Jekyll2020-04-05T07:40:38+00:00http://www.outerproduct.space/feed.xmlOuter Product SpaceThis is where the description of your site will go. You should change it by editing the _config.yml file. It can be as long as you like! Happy blogging... ❤Expected Calibration Error2020-01-20T00:00:00+00:002020-01-20T00:00:00+00:00http://www.outerproduct.space/2020/01/20/expected-caliberation-error<p>ECE quantifies how much you can trust the class confidences your model gives. It is the difference between the predicted confidence and reality.</p> <h2 id="how-to-compute-the-caliberation-error">How to compute the Caliberation Error?</h2> <p>Computing calibration error is simple. First run your model over a set of samples and collect all the predictions. We need to compute two things. The accuracy of the predictions and the average confidence of all the predictions.</p> <p>The absolute difference between the two is calibration error.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">compute_accuracy</span><span class="p">(</span><span class="n">predictions</span><span class="p">,</span> <span class="n">targets</span><span class="p">):</span> <span class="k">assert</span> <span class="n">predictions</span><span class="o">.</span><span class="n">shape</span><span class="o">==</span><span class="n">targets</span><span class="o">.</span><span class="n">shape</span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">predictions</span><span class="o">==</span><span class="n">targets</span><span class="p">)</span> <span class="k">def</span> <span class="nf">compute_caliberation_error</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">,</span> <span class="n">gt_idxs</span><span class="p">):</span> <span class="n">predicted_class_idxs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">,</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">acc</span> <span class="o">=</span> <span class="n">compute_accuracy</span><span class="p">(</span><span class="n">predicted_class_idxs</span><span class="p">,</span> <span class="n">gt_idxs</span><span class="p">)</span> <span class="n">conf</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">,</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">absolute</span><span class="p">(</span><span class="n">acc</span><span class="o">-</span><span class="n">conf</span><span class="p">),</span><span class="n">acc</span><span class="p">,</span><span class="n">conf</span> </code></pre></div></div> <h2 id="expected-caliberation-error">Expected Caliberation Error</h2> <p>A classifier is said to be well calibrated if it has a low ECE. I came across Expected Calibration Error from a recent, aptly titled paper “<a href="https://arxiv.org/abs/1912.03263"><em>Your classifier is secretly an energy based model and you should treat it like one</em></a>”. It defines ECE as</p> <script type="math/tex; mode=display">ECE = \sum _{m=1}^{M} \frac{|B_m|}{n} |acc(B_m) - conf(B_m)|</script> <p>To compute ECE of a model, we simply bins the predictions first. Then, calculate average of all the calibration errors.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">expected_caliberation_error</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">,</span><span class="n">gt_idxs</span><span class="p">,</span> <span class="n">num_bins</span><span class="o">=</span><span class="mi">20</span><span class="p">):</span> <span class="n">delta</span> <span class="o">=</span> <span class="mf">1.0</span><span class="o">/</span><span class="n">num_bins</span> <span class="n">predicted_confidences</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">,</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">data</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span><span class="mf">1.0</span><span class="p">,</span><span class="n">delta</span><span class="p">):</span> <span class="n">h</span> <span class="o">=</span> <span class="n">l</span><span class="o">+</span><span class="n">delta</span> <span class="c1"># bin the predictions </span> <span class="n">idxs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argwhere</span><span class="p">((</span><span class="n">predicted_confidences</span><span class="o">&lt;=</span><span class="n">h</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">predicted_confidences</span><span class="o">&gt;</span><span class="n">l</span><span class="p">))</span><span class="o">.</span><span class="n">flatten</span><span class="p">()</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">idxs</span><span class="p">)</span><span class="o">==</span><span class="mi">0</span><span class="p">:</span><span class="k">continue</span> <span class="c1"># compute caliberation error </span> <span class="n">ce</span><span class="p">,</span><span class="n">acc</span><span class="p">,</span><span class="n">conf</span> <span class="o">=</span> <span class="n">compute_caliberation_error</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">[</span><span class="n">idxs</span><span class="p">,:],</span> <span class="n">gt_idxs</span><span class="p">[</span><span class="n">idxs</span><span class="p">])</span> <span class="n">data</span><span class="o">.</span><span class="n">append</span><span class="p">([</span><span class="n">l</span><span class="p">,</span><span class="n">h</span><span class="p">,</span><span class="n">ce</span><span class="p">,</span><span class="n">acc</span><span class="p">,</span><span class="n">conf</span><span class="p">])</span> <span class="c1"># average the computed caliberation errors </span> <span class="n">ece</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">([</span><span class="n">ce</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span><span class="n">_</span><span class="p">,</span><span class="n">ce</span><span class="p">,</span><span class="n">_</span><span class="p">,</span><span class="n">_</span> <span class="ow">in</span> <span class="n">data</span><span class="p">])</span> <span class="k">return</span> <span class="n">ece</span><span class="p">,</span> <span class="n">data</span> </code></pre></div></div> <p>For testing, I fine-tuned a model with resnet18 stem and computed predictions over <a href="https://github.com/fastai/imagenette">ImageWoof</a> dataset.</p> <p>Before training, the model has very high ECE</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">learner</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">'stage-0'</span><span class="p">);</span> <span class="n">probs</span><span class="p">,</span> <span class="n">targets</span> <span class="o">=</span> <span class="n">learner</span><span class="o">.</span><span class="n">get_preds</span><span class="p">()</span> <span class="n">class_confidences</span> <span class="p">,</span><span class="n">gt_idxs</span> <span class="o">=</span> <span class="n">probs</span><span class="o">.</span><span class="n">numpy</span><span class="p">(),</span> <span class="n">targets</span><span class="o">.</span><span class="n">numpy</span><span class="p">()</span> <span class="n">ece</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">expected_caliberation_error</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">,</span><span class="n">gt_idxs</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plot_figure</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="n">delta</span><span class="p">)</span> </code></pre></div></div> <p><img src="/images/2020-01-20-expected-caliberation-error_files/output_11_0.png" alt="png" /></p> <p>After training, ECE has considerably reduced.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">learner</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">'stage-1'</span><span class="p">);</span> <span class="n">probs</span><span class="p">,</span> <span class="n">targets</span> <span class="o">=</span> <span class="n">learner</span><span class="o">.</span><span class="n">get_preds</span><span class="p">()</span> <span class="n">class_confidences</span><span class="p">,</span> <span class="n">gt_idxs</span> <span class="o">=</span> <span class="n">probs</span><span class="o">.</span><span class="n">numpy</span><span class="p">(),</span> <span class="n">targets</span><span class="o">.</span><span class="n">numpy</span><span class="p">()</span> <span class="n">ece</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">expected_caliberation_error</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">,</span><span class="n">gt_idxs</span><span class="p">)</span> </code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plot_figure</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="n">delta</span><span class="p">)</span> </code></pre></div></div> <p><img src="/images/2020-01-20-expected-caliberation-error_files/output_14_0.png" alt="png" /></p>ECE quantifies how much you can trust the class confidences your model gives. It is the difference between the predicted confidence and reality.Gradient Through Concatenation2019-07-23T00:00:00+00:002019-07-23T00:00:00+00:00http://www.outerproduct.space/2019/07/23/gradient-through-concatenation<p>Concatenation of vector is a common operation in computational graph of modern day Deep Learning Networks. This post describes how to compute derivative of the output w.r.to the parameters of concatenation.</p> <script type="math/tex; mode=display">z = C(x,y)</script> <p>Where $C$ is concat operation. We are interested in computing $\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\partial y}$</p> <p>Assuming $x\in \mathbb{R}^m$ and $x\in \mathbb{R}^n$</p> <p>We can rewrite the concat operation as</p> <script type="math/tex; mode=display">z = \begin{bmatrix}I_m\\0\end{bmatrix}x+\begin{bmatrix}0\\I_n\end{bmatrix}y</script> <p>which implies</p> <script type="math/tex; mode=display">\frac{\partial z}{\partial x} = \begin{bmatrix}I_m\\0\end{bmatrix}</script> <script type="math/tex; mode=display">\frac{\partial z}{\partial y} = \begin{bmatrix}0\\ I_n\end{bmatrix}</script>Concatenation of vector is a common operation in computational graph of modern day Deep Learning Networks. This post describes how to compute derivative of the output w.r.to the parameters of concatenation.Gradient Through Addition with broadcasting2018-09-18T00:00:00+00:002018-09-18T00:00:00+00:00http://www.outerproduct.space/2018/09/18/gradient-through-addition-with-bradcasting<p>Calculating gradient across an addition op is considered a simple algebra trick. But addition of tensors in real applications allow broadcasting. In this post, we examine how to compute gradient even in such situations. Let us begin with fundamentals.</p> <h2 id="gradient-of-addition">Gradient of Addition</h2> <p>Consider a simple sequence of operations. $A$ and $B$ are inputs which ultimately leads to computation of a scalar loss/error term $l$.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} z &= A+B \\ l &\dashleftarrow z \end{aligned} %]]></script> <p>We are interested in the gradient of both the inputs w.r.to $l$. Lets just go ahead and apply chin rule to get</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} \frac{\partial l}{\partial A} &= \frac{\partial l}{\partial z} \\ \frac{\partial l}{\partial B} &= \frac{\partial l}{\partial z} \end{aligned} %]]></script> <h2 id="addition-in-neural-networks">Addition in Neural Networks.</h2> <p>Lets examine how addition in a feed forward layer is computed. This revels how additions are generally done in real use cases. Consider a simple linear feed forward layer with $x$ as input. The transformation of the layer is given by</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} z &= W^Tx+b \\ l &\dashleftarrow z \end{aligned} %]]></script> <p>The gradients of the parameters w.r.to the loss terms are</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} \frac{\partial l}{\partial b} &=\frac{\partial l}{\partial z} \\ \frac{\partial l}{\partial W} &= x \frac{\partial l}{\partial z}^T \end{aligned} %]]></script> <p>$W^Tx$ is a vector which is of same size dimensions as $b$ if $x$ is also a vector. But in practise, networks are always trained with min-batches of samples at a time. Which makes $x$ a 2 dimensional tensor and suddenly the quantities $W^Tx$ and $b$ have different dimensions.</p> <h2 id="addition-with-broadcasting">Addition with broadcasting</h2> <p>An implicit assumption we make while adding two multi dimensional quantities is that their dimensions always match. But numerical frameworks allow addition even when the dimensions of the operands are not the same. This is called addition with broadcasting.</p> <p>Addition is allowed if the arrays are broadcast compatible with each other. <a href="https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html">Numpy’s docs</a> describes two arrays are broadcast compatible if there dimensions are compatible. Two dimensions are compatible if they are</p> <ol> <li>Equal</li> <li>one of them is 1</li> </ol> <p>When arrays do not have same number of dimensions, the arrays are compatible if the smaller array’s dimensions can be stretched to both sides by simply adding dimensions of 1 and then the dimensions of both arrays become compatible.</p> <h2 id="gradient-through-broadcasting">Gradient through Broadcasting</h2> <p>Lets unwrap what actually happens during a broadcast operation. For simplicity, lets say we are trying to add two tensors $A$ and $B$. $A$ and $B$ agree on all dimensions except the last where A has a dimension of size $3$ while $B$ is dimensionless.</p> <p>In this case, $B$ would be broadcast over $A$ to facilitate the addition as follows.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} A+B &=\begin{bmatrix} A_{:,0} & A_{:,1} & A_{:,2} \end{bmatrix}+B \\&= \begin{bmatrix} A_{:,0} & A_{:,1} & A_{:,2} \end{bmatrix}+\begin{bmatrix} B & B & B \end{bmatrix} \\&= \begin{bmatrix} A_{:,0}+B & A_{:,1}+B & A_{:,2}+B \end{bmatrix} \end{aligned} %]]></script> <p>This allows us to visualize the computations better. For the following computation,</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} z &= A+B \\ l &\dashleftarrow z \end{aligned} %]]></script> <p>we know that $\frac{\partial l}{\partial A}$ remains the same. But $\frac{\partial l}{\partial B}$ is for the tensor that has been broadcast and hence needs to be adjusted. The gradient needs to be amplified by the factor of the broadcasting. Hence</p> <script type="math/tex; mode=display">\frac{\partial l}{\partial B} = \mathbf{n}\frac{\partial l}{\partial z}</script> <p>A curious case to note here that, the gradient of parameter $B$ has a dimension of $A$ in it and hence is dependent on size of $A$. This might be the reason for differing behaviour of neural networks when they are trained with different batch sizes.</p>Calculating gradient across an addition op is considered a simple algebra trick. But addition of tensors in real applications allow broadcasting. In this post, we examine how to compute gradient even in such situations. Let us begin with fundamentals.Softmax Classifier2018-03-28T00:00:00+00:002018-03-28T00:00:00+00:00http://www.outerproduct.space/2018/03/28/softmax_classifier<p>Imagine we have a dataset ${x,y}_{i=0}^m$ where $x$ is a data point and $y$ indicates the class $x$ belongs to. For deriving LDA classifier, we had modelled the class conditional density $P(x|y)$ as a gaussian and derived the posterior probabilities $P(y|x)$. Here, we will directly model the posterior with a linear function. Since the posterior directly models what class a data point belongs to, we don’t have much to do after to get a classifier.</p> <p>But modelling $P(y|x)$ with only a linear projection $w^Tx$ has some problems. There is no easy way to restrict $w^Tx$ to always fall in $[0,1]$ nor assure that $\sum_k P(y=k|x) = 1$.</p> <p>We want a projection of the data such that it forms a clear probability distribution over the classes. Since this is near impossible with only a linear function, we stack a non parametric transformation following the linear transformation.</p> <p><strong>Softmax</strong> is a vector valued function defined over a sequence $(z_k)$ as</p> <script type="math/tex; mode=display">\begin{aligned} \operatorname{softmax}(z)_k = \frac{\operatorname{exp}[z_k]}{\sum_j\operatorname{exp}[z_j]}\end{aligned}</script> <p>Softmax preserves the relative magnitude of its input i.e the larger input coordinate get the larger output value. Softmax also squashes the values to lie in range $[0,1]$ and makes their sum equal to $1$.</p> <script type="math/tex; mode=display">\begin{aligned} \sum_k \frac{\operatorname{exp}[z_k]}{\sum_j\operatorname{exp}[z_j]} = \frac{\sum_k \operatorname{exp}[z_k]}{\sum_j\operatorname{exp}[z_j]} = 1\end{aligned}</script> <p>So our classifier is</p> <script type="math/tex; mode=display">\begin{aligned} P(y|x) = \operatorname{softmax}(w^Tx)\end{aligned}</script> <h1 id="derivative-of-the-softmax-function">Derivative of the Softmax function</h1> <p>We will need the derivative of softmax function later on. So let’s figure out what it is. We can begin by writing softmax in a concise form.</p> <script type="math/tex; mode=display">\begin{aligned} s_k = \frac{e_k}{\Sigma}\end{aligned}</script> <p>where $e_k = \operatorname{exp}[z_k]$ and $\Sigma = \sum_j\operatorname{exp}[z_j]$. With $\frac{\partial e_k}{\partial z_k} = e_k$ and $\frac{\partial \Sigma}{\partial z_p} = e_p$, we can easily derive the derivative for softmax function as follows.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} \text{ when $p \neq k$} \\ \frac{\partial s_k}{\partial x_p} &= e_k\left[ \frac{-1}{\Sigma^2} e_p\right] \\&= -s_ks_p \\ \text{ when $p = k$} \\ \frac{\partial s_k}{\partial x_p} &= \frac{ e_k \Sigma- e_p e_k}{\Sigma^2} \\&= s_k-s_ps_k \\ \text{in general} \\ \frac{\partial s_k}{\partial x_p} &= s_k(\delta_{kp} - s_p)\end{aligned} %]]></script> <p>$\delta_{kp}$ is dirac delta function which is $1$ only when $k=p$ and $0$ otherwise.</p> <h1 id="estimating-model-parameter-using-likelihood">Estimating Model Parameter using Likelihood</h1> <p>Now that we have a complete model of the classifier, $P(y|x) = \operatorname{softmax}(w^Tx)$, all that is remaining is to estimate the model’s parameters $w$ from the dataset. We can begin by using <a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">likelihood</a> of the model explaining the training data.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} L(w) &= \prod_x \prod_k P(k|x;w)^{y_k} \\ &= \prod_x \prod_k \operatorname{softmax}(w_k^Tx)^{y_k} \end{aligned} %]]></script> <p>Likelihood gives a measure of how much the model explains the data ${x,y}$ for a given parameter $w$. To get the optimum value for the parameter, all we have to do find the value of $w$ which maximises the likelihood.</p> <p>The likelihood function is a bit difficult to work with on its own. But we take negative of the log of likelihood function<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> so that all products gets converted to sum and all exponentials gets converted to products. <script type="math/tex">% <![CDATA[ \begin{aligned} E(w) &= -\log L(w) \\ &= -\sum_x \sum_k y_k \log s_k\end{aligned} %]]></script></p> <p>All we have to do to get the best parameter is to minimise the negative log likelihood (thereby maximising likelihood)</p> <script type="math/tex; mode=display">\begin{aligned} w_{opt} = \operatorname{argmin}_w E(w)\end{aligned}</script> <p><em>Note: If we think of $y$ as a probability distribution of data belonging to each of the classes and $s$ as our model’s prediction of the same, then $E(w)$ is cross entropy between the two distributions.</em></p> <p>For computing $w_{opt}$, we simply have to find derivative of $E(w)$ w.r.to $w$ and equate it to 0. But finding derivative over then entire $w$ is difficult and non-intuitive. So let’s break it down and find derivatives over each the columns $w_p$ separately.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} \nabla&_{w_p} E(w) \\ &= -\sum_x \sum_k y_k \frac{1}{s_k} \frac{\partial s_k}{\partial w_p} \\ &=-\sum_x \sum_k y_k \frac{1}{s_k} \frac{\partial s_k}{\partial z_p} \frac{\partial z_p}{\partial w_p} \\ &=-\sum_x \sum_k y_k \frac{1}{s_k} s_k(\delta_{kp} - s_p) x \\ &= -\sum_x \sum_k y_k (\delta_{kp} - s_p) x\end{aligned} %]]></script> <p>$\sum_k y_k (\delta_{kp} - s_p)$ can be expanded as $\sum_k y_k \delta_{kp} - s_p\sum_k y_k$. Only $\delta_{pp}=1$ and $\sum y_k =1$. Thus the original term evaluates to $y_p - s_p$.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} \nabla_{w_p} E(w) &= \sum_x (s_p - y_p) x \\ \nabla_{w} E(w) &= \sum_x (s - y) x\end{aligned} %]]></script> <p>Now we have a problem here. Setting $\nabla_w E(w)=0$ does not give any information about $w$. However, what the gradient do tells is that in space of $w$, which is the direction to move (change $w$) so that the change in $E$ is the largest. So if we move in the exact opposite direction (negative of gradient), we will get maximum reduction in $E$. Since $E$ is measuring the difference between model’s prediction and the true labels, decreasing $E$ would mean our model is getting better. Enter <strong>Gradient Descent</strong>.</p> <h1 id="gradient-descent-algorithm">Gradient Descent Algorithm</h1> <p>Gradient descent is exactly that; gradient-descent. If you want to minimise a function, keep moving in negative direction of its gradient (thereby descending).</p> <p><a href="https://en.wikipedia.org/wiki/Gradient_descent">Gradient descent</a> is an optimisation algorithm for cases where there is no analytic or easy solution for the parameters, but gradients of the models can be computed at each point. The algorithm simply says, if $L$ is some loss function which measures how good the model is with parameter $\theta$, then we can update $\theta$ as to make the model better. <script type="math/tex">\begin{aligned} \theta_{new} = \theta - \alpha \nabla_\theta L\end{aligned}</script></p> <p>Now repeat the same with new $\theta_{new}$ and the model with keep getting better. There are some caveats however. The behaviour of gradient descent entirely depends on choice of step size $\alpha$ and even then, convergence to global optimum is not guaranteed.</p> <p>In practise however, gradient descent performs well. We do have some tricks to pick the (seemingly) best step size and some other ways to ensure model improves. In our case, the update step is simply,</p> <script type="math/tex; mode=display">\begin{aligned} w_{new} = w - \alpha \nabla_{w}E(w)\end{aligned}</script> <p>To compute the true gradient direction of the model, we need to evaluate the model over every possible data. This is impossible because (1) intractable amount of computation and (2) we don’t have labels for all possible data (if we did, what is the point of building a classifier).</p> <p>So, we are computing an approximate model gradient over the data that we have. But even this is a lot of work for getting an approximate gradient and that is not end of the story. We have to keep iterating. Since we are approximating, why not approximate further. Pick a sample at random, compute gradient over it and update the model. This is Stochastic Gradient Descent.</p> <p><a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">Stochastic Gradient Descent</a> is not that stable. Gradient of each samples do not agree with each other and hence, the frequent updates simply results in random motion in the parameter space. To make more stable, we compute gradient over a small set of samples or a batch. This is batched stochastic gradient descent. Batched SGD updates the model more frequently than pure gradient descent and is more stable than vanilla SGD. In fact, batched SGD is so commonly used that now it has become the vanilla. SGD now refers to batched SGD.</p> <p>This is how you implement SGD.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} idx &\gets 1 \\ W &= initialise\_weights() \\ x_b,y_b &= next(get\_batch)\\ g_b &= \nabla_W s_b \\ W &\gets W - \alpha g_b \end{aligned} %]]></script> <h1 id="implementing-softmax-classifier-and-its-gradients">Implementing Softmax classifier and its gradients</h1> <p>Implementing the forward prediction of the classifier is pretty straight forward. First we have to do a matrix vector multiplication to implement $z=w^Tx$ and then point-wise exponentiate all the terms in $z$.</p> <p>However, we have to sum all of these exponentiated terms to get the denominator in the softmax step. Since exponentiation creates huge numbers if components of $z$ are greater than $1$, this creates some numerical errors.</p> <p><img src="figures/exponential-curve" alt="image" />{width=”0.7\linewidth”}</p> <p>To get rid of the numerical errors we use the following trick.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} \operatorname{softmax}(z)_k &= \frac{e^{z_k}}{\sum_i e^{z_i}} \\ &= \frac{e^{-M}e^{z_k}}{e^{-M}\sum_i e^{z_i}} \\ &= \frac{e^{z_k-M}}{\sum_i e^{z_i-M}}\end{aligned} %]]></script> <p>If we choose $M$ large enough such that all the terms in the powers are negative or $0$, all the exponentiated terms will be small. So we set $M = \max z_i$.</p> <p>Following code sample shows how the model’s prediction is implemented. The code has been vectorised so that it can predict for a batch of $x$ at once.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def get_predictions(x,W): z = np.matmul(x,W.T) M = np.max( z, axis=-1, keepdims=True ) e = np.exp(z-M) # normalisation trick so # that largest of z is 0 sigma = e.sum( axis=-1, keepdims=True ) s = e/sigma return s </code></pre></div></div> <p>Unlike forward pass, implementing gradient is very simple. It’s only an outer product between two vectors $s-y$ and $x$. But, when we implement it for a batch of samples and it’s predictions, the outer product can be implemented as a matrix multiplication. See the code sample below.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def get_batch_gradient( x_b, # input y_b, # target s_b # prediction ): g_b = np.matmul( (s_b-y_b).T, x_b ) return g_b/batch_size </code></pre></div></div> <h1 id="model-performance">Model Performance</h1> <p><a href="https://www.cs.toronto.edu/~kriz/cifar.html">Cifar 10</a> is a image dataset having 10 image classes. In this section, we test our simple softmax classifier’s performance on it.</p> <p>To tune the model to get optimum performance, we first need to find the best hyper-parameters (batch size and learning rate). For this, we first split the training set into two. A smaller set for validation and the rest to be used solely for training. The model is trained only on this new training set and validation set will be like our proxy test set while we find the best hyper parameter.</p> <p>So we train the model for relative shorter duration (10000 gradient updates) and see its performance on validation set for different choices of hyper parameters. The following table list model performances for different hyper parameter combinations.</p> <hr /> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 4 8 16 32 $10^{-1}$ 0.20 0.16 0.28 0.31 $10^{-2 }$ 0.22 0.29 0.35 0.38 $10^{-3 }$ 0.33 0.34 0.36 0.36 $10^{-4 }$ 0.29 0.30 0.30 0.30 $10^{-5}$ 0.18 0.19 0.18 0.19 64 128 256 512 $10^{-1}$ 0.26 0.29 0.22 0.28 $10^{-2 }$ 0.38 0.39 0.40 0.40 $10^{-3 }$ 0.37 0.37 0.37 0.37 $10^{-4 }$ 0.31 0.30 0.30 0.30 $10^{-5}$ 0.19 0.19 0.19 0.17 -------------- ------ ------ ------ ------ -- -- -- -- </code></pre></div></div> <p>Batch size $256$ and learning rate $0.01$ gave the best performance. So we will train the model for longer duration(1000000 gradient updates) with these parameters. The following plot shows validation loss during training progress.</p> <p><img src="figures/validation-loss" alt="image" /></p> <p>We get a final test accuracy of 38.05%.</p> <h1 id="conclusion">Conclusion</h1> <p>LDA model gave us 37.85% accuracy on Cifar 10 dataset. The softmax classifier is giving us 38% accuracy. It appears to be a close tie between both the models, but one important distinction is that LDA distinctly modelled the data as gaussian while we made no such assumption while designing the softmax classifier.</p> <p>Our simple linear classifier appear useless when compared to bigger and complex models(CNNs) that achieves near perfect accuracy on cifar 10. But there is some values in learning these simple ones first. They do teach some very valuable lessons about data modelling. They are also very good to test implementing optimiser algorithms like SGD we have implemented for this post. Do test them out on some other problems.</p> <h1 id="code">Code</h1> <p>The code is <a href="https://github.com/nithishdivakar/blog-post-codes/tree/master/softmax-classifiers">here</a>.</p> <div class="footnotes"> <ol> <li id="fn:1"> <p>$\log$ is a strictly increasing function <a href="#fnref:1" class="reversefootnote">&#8617;</a></p> </li> </ol> </div>Imagine we have a dataset ${x,y}_{i=0}^m$ where $x$ is a data point and $y$ indicates the class $x$ belongs to. For deriving LDA classifier, we had modelled the class conditional density $P(x|y)$ as a gaussian and derived the posterior probabilities $P(y|x)$. Here, we will directly model the posterior with a linear function. Since the posterior directly models what class a data point belongs to, we don’t have much to do after to get a classifier.Differentiable Computations2018-02-18T00:00:00+00:002018-02-18T00:00:00+00:00http://www.outerproduct.space/2018/02/18/differentiable-computations<p>Auto gradient is a nice feature found in many computational frameworks. Specify the computation in forward direction and the framework computes backward gradients. Let’s talk about the generic method to do this.</p> <p>Let’s say we have to compute the result of ‘something’. It may be a nasty heat equation or some logic driven steps to get from input to output. Abstracting the steps involved gives us a sequence of equations <script type="math/tex">\begin{aligned} z_i = f_i(z_{a(i)})\end{aligned}</script></p> <p>The $z$’s are intermediate variables of the computation steps or they may be parameters. The selections $z_{a(i)}$ are inputs to $f_i$.</p> <p><em>What does gradient of this sequence of computation mean?</em></p> <p>If is the final step of the computation, then computing gradients of the sequence means i</p> <p>$\frac{\partial z_n}{\partial z_i}$ are the gradients if $z_n=f_n(z_{a(n)})$ is the final step. Computing all those gradients gives us how parameters change w.r.to the output.</p> <h1 id="handling-branches-and-loops">Handling Branches and loops</h1> <p>For any general computation to be included, we need to talk about branches and loops. How are these handled in our model?</p> <p>Conditional branches can be represented by indicator functions. See <a href="https://en.wikipedia.org/wiki/Indicator_function#Derivatives_of_the_indicator_function">this entry</a> for details on computing derivative of indicator functions.</p> <p>Loops could be unrolled in to a sequence of functions. All of them would simply share a same parameters, but inputs will be output of the function representing previous iteration. For example</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>begin loop 1:3 x = x + c end </code></pre></div></div> <p>can be unrolled as</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} x_1 &= x + c \\x_2 &= x_1 + c \\x_3 &= x_2 + c\end{aligned} %]]></script> <p>This won’t work for infinite loops because the unrolling will never end. Infinite loops has no business in real world computation. If a loop cannot be unrolled even after applying the “reality of the universe”, we are not talking about a computational system . It might be an event loop or a queue. Neither needs gradients!</p> <h1 id="forward-computation-as-a-constrained-optimisation-problem">Forward computation as a Constrained optimisation problem</h1> <p>Without loss of generality, we can say that all this hoopla of computing gradient is to minimise the final value. Even if this is not the case, like for example, if maximising the final result was the intent, then append a negating function at the end of the sequence. There are many other techniques out there to convert different problems to a minimization problem.</p> <p>Now that we have <em>that</em> out of the way, lets look at the following problem.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} &\min{z_n} \\ s.t~z_i &= f_i(z_{a(i)})\end{aligned} %]]></script> <p>The formulation if a little bit weird. All it is saying is, minimise $z_n$ such that, outputs of computations ($f_i$) are inputs to some other computation (all $f$’s which has $z_i$ as input). Constraints are maintaining integrity of the sequence. So we managed to represent same thing is two ways, each saying the same thing. Great!</p> <h1 id="how-do-you-solve-a-constrained-optimisation-problem">How do you solve a constrained optimisation problem?</h1> <p>Using the method of <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange multipliers</a>. It basically says that once we define Lagrange’s function</p> <script type="math/tex; mode=display">\begin{aligned} L(z,\lambda) = z_n - \sum_i\lambda_i(z_i - f_i(z_{a(i)}))\end{aligned}</script> <p>These $L$’s gradient w.r.to its parameters vanishes at optimum points of original function as well. So we get</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} \nabla_{\lambda_i}=0 &\implies z_i = f_i(z_{a(i)}) \\ \nabla_{z_n}=0 &\implies \lambda_n = 1 \\ \nabla_{z_i}=0 &\implies \lambda_i = \sum_{k\in b(i)}\lambda_k \frac{\partial f_k}{\partial z_i}\end{aligned} %]]></script> <p>Final expression of $\lambda_i$’s will give $\frac{\partial z_n}{\partial z_i}$ and hence all the gradients of our original computation. $b(\cdot)$ is like inverse of $a(\cdot)$. $a(i)$ gives which $z$’s are arguments of $f_i$ while $b(i)$ simply gives which $f$ has $z_i$ as an argument. $b=a^{-1}$ ??. Anyway, these equations fits nicely as a linear system</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} A\lambda = \begin{bmatrix} 0\\ \vdots\\ 0\\ -1 \end{bmatrix} \quad ; A_{k,i} = \begin{cases} \frac{\partial f_k}{\partial z_i} & k\in b(i) \\ -1 & k=i \\ 0 & otherwise \end{cases}\end{aligned} %]]></script> <p>$A$ is an upper triangular matrix with 1’s on the diagonal. Otherwise, we are looking at sequence of computation which needs result of a future. That is too complicated for now(example of explicit systems).</p> <p>This linear system of equations opens up myriad of possibilities of computing gradients faster. The simplest of which is back substitution since $A$ is triangular. If the computation we are dealing with is a forward pass of a neural network, what we get out of the back substitution is “backprop" algorithm!!</p> <h1 id="deriving-backprop-in-a-weird-way">Deriving backprop, in a weird way</h1> <p>Lets look at a very simple Neural network</p> <p><script type="math/tex">% <![CDATA[ \begin{aligned} a_1 &= \sigma(x w_1) \\ a_2 &= \operatorname{sofmax}(a_1 w_2) \\ l &= \operatorname{loss}(a_2,y)\end{aligned} %]]></script> If we simplify (ahem!) it up according to our problem, we get</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} z_1&=x,~ z_2=y, z_3=w_1, z_4=w_2 \\z_5 &= z_1z_3 \\z_6 &= \sigma(z_5) \\z_7 &= z_6z_4 \\z_8 &= \operatorname{softmax}(z_7) \\z_9 &= \operatorname{loss}(z_8,z_2)\end{aligned} %]]></script> <p>This gives us the linear system</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} \begin{bmatrix} \\-1 & & & & \frac{\partial f_{5}}{\partial z_{1}} & & & & \\ &-1 & & & & & & & \frac{\partial f_{9}}{\partial z_{2}} \\ & &-1 & & \frac{\partial f_{5}}{\partial z_{3}} & & & & \\ & & &-1 & & & \frac{\partial f_{7}}{\partial z_{4}} & & \\ & & & &-1 & \frac{\partial f_{6}}{\partial z_{5}} & & & \\ & & & & &-1 & \frac{\partial f_{7}}{\partial z_{6}} & & \\ & & & & & &-1 & \frac{\partial f_{8}}{\partial z_{7}} & \\ & & & & & & &-1 & \frac{\partial f_{9}}{\partial z_{8}} \\ & & & & & & & & -1 & \end{bmatrix} \begin{bmatrix} \lambda_{1}\\ \lambda_{2}\\ \lambda_{3}\\ \lambda_{4}\\ \lambda_{5}\\ \lambda_{6}\\ \lambda_{7}\\ \lambda_{8}\\ \lambda_{9}\\ \end{bmatrix} = \begin{bmatrix} 0\\ 0\\ 0\\ 0\\ 0\\ 0\\ 0\\ 0\\ -1 \end{bmatrix}\end{aligned} %]]></script> <p>Apply back substitution and we get</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} \lambda_3 &= \lambda_5 \frac{\partial f_5}{\partial z_3}\quad \lambda_4 = \lambda_7 \frac{\partial f_7}{\partial z_4}\\ \lambda_5 &= \lambda_6 \frac{\partial f_6}{\partial z_6}\quad \lambda_6 = \lambda_7 \frac{\partial f_7}{\partial z_6}\\ \lambda_7 &= \lambda_8 \frac{\partial f_8}{\partial z_7}\quad \lambda_8 = \lambda_9 \frac{\partial f_9}{\partial z_8}\\ \lambda_3 &= \frac{\partial l}{\partial z_8} \frac{\partial z_8}{\partial z_7} \frac{\partial z_7}{\partial z_6} \frac{\partial z_6}{\partial z_6} \frac{\partial z_5}{\partial w_1}\\ \lambda_4 &= \frac{\partial l}{\partial z_8} \frac{\partial z_8}{\partial z_7} \frac{\partial z_7}{\partial w_2}\end{aligned} %]]></script> <p>and there it is!! $\lambda_3$ is the gradient for parameter $w_1$ and $\lambda_4$ represent the gradient of $w_2$.</p> <p>Now the structure of matrix $A$ for this problem isn’t that interesting. The example network is very simple. Almost too simple. The computational graph is almost a line graph. But with more interesting cases, like for example, <a href="https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf">inception</a> architecture, the matrix will have very nice structure. A very particular example is dense block from <a href="https://arxiv.org/abs/1608.06993">DenseNet</a>. The matrix will have a fully filled upper triangular.</p> <p><strong>Attribution</strong> I had my first encounter with the constrained optimisation view of computation in Yann LeCunn’s <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-88.pdf">1988 paper</a> “A Theoretical Framework for back propagation”. Incidentally, this is the first paper I understood about deep learning and related field. Do give it a read.</p>Auto gradient is a nice feature found in many computational frameworks. Specify the computation in forward direction and the framework computes backward gradients. Let’s talk about the generic method to do this.Linear Classifiers2018-02-18T00:00:00+00:002018-02-18T00:00:00+00:00http://www.outerproduct.space/2018/02/18/linear_classifiers<p>In the <a href="/2018/02/11/bayes-error">post on bayes error</a>, we discussed what is the best classifier if the features are not enough to tell the class apart. We also derived that in such situation, the best classifier is</p> <script type="math/tex; mode=display">\begin{aligned} h(x) = sign \left( n(x) - \frac{1}{2} \right) \end{aligned}</script> <p>This formulation cannot be used in general situations as there is no easy way to estimate $n(x) = P(y=+1|x)$ for a general distribution. But what if $x$ has a simple distribution?</p> <p>Lets assume that the data is a gaussian for each class</p> <script type="math/tex; mode=display">\begin{aligned} P(x|y) = N(\mu_y,\Sigma_y) = f_y\end{aligned}</script> <p>The parametric form of $P(x|y)$ immediately give us a closed form for $n(x)$ by a simple application of bayes rule</p> <script type="math/tex; mode=display">\begin{aligned} n(x) = \frac{pf_{+1}}{p f_{+1}+(1-p)f_{-1}}\end{aligned}</script> <p>which in turn gives us a simple classifier</p> <script type="math/tex; mode=display">\begin{aligned} n(x) - \frac{1}{2} = \frac{pf_{+1}}{pf_{+1}+(1-p)f_{-1}} - \frac{1}{2} \end{aligned}</script> <script type="math/tex; mode=display">\begin{aligned} = \frac{f_{+1}}{f_{-1}} - \frac{1-p}{p} \end{aligned}</script> <p>To further simplify, we use a strictly increasing property of $\log$ function and write</p> <script type="math/tex; mode=display">\begin{aligned} h(x) = \operatorname{sign} \left( \log \frac{f_{+1}}{f_{-1}}- \log\frac{1-p}{p} \right)\end{aligned}</script> <p>This gives us simpler form of the classifier</p> <script type="math/tex; mode=display">\begin{aligned} h(x) = sign(x^TAx + b^Tx+c)\end{aligned}</script> <p>where $A = \Sigma_{-1}^{-1}-\Sigma_{+1}^{-1}$.</p> <p>If we further assume that class covariances are the same( $\Sigma_{-1}=\Sigma_{+1}$), then what we get a linear classifier. <script type="math/tex">\begin{aligned} h(x) = \operatorname{sign}(b^Tx+c)\end{aligned}</script></p> <h1 id="lda-for-multi-class-classification">LDA for multi-class classification</h1> <p><a href="/2018/02/11/bayes-error">Bayes error</a> tell us that in general case, the classifier that has least error is</p> <script type="math/tex; mode=display">\begin{aligned} h(x) = \operatorname{arg max}_y n_y(x) \end{aligned}</script> <p>where $n_y(x) = P(y=k|x)$ is the class-wise densities. In general, the data need not follow any distribution and hence, the class-wise densities need not have a closed form. To mitigate this, we first assume that the data does follow a parametric distribution.</p> <p>We will assume that the class conditional densities $p(x|y)$ are Gaussian distributed. This means that each class of the data is centred around some point in the data space(class-wise mean), the density of the data belonging to this class decreases as we go further away from this mean point. We will further assume that all these class-wise distributions have same covariance. Although this assumption is restrictive, this helps in keeping our classifier simple. Deriving a variant of classifier which accommodates for different covariances is fairly straightforward from the following steps. Thus, we have</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} p(x|y) & = f_y(x) \sim N(\mu_y, \Sigma) \\ \mu_y &= \frac{1}{m_y} \sum_{x\in class[y]}x \\ \Sigma &= \frac{1}{m} \sum_{(x,y)} (x-\mu_y)(x-\mu_y)^T\end{aligned} %]]></script> <p>With these analytic forms, it is easy to get a closed form for $n_y$ by a straight forward application of Bayes rule with $p_y=p(y=k)$ and we get</p> <script type="math/tex; mode=display">\begin{aligned} n_y(x) = \frac{p_y f_y(x)}{\sum_yp_y f_y(x)}\end{aligned}</script> <p>From here, getting our classifier in only a matter of simplifying the equations. So we have</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} h(x) &= \operatorname{arg max} p_yf_y(x) \\ &= \operatorname{arg max} w_y^Tx + b_y\end{aligned} %]]></script> <p>Where $w_y = \Sigma^{-1}\mu_y$ and $b_y = \log p_y - \frac{1}{2} \mu_y^{T}\Sigma^{-1}\mu_y$. See <strong>Appendix 1</strong> for full derivation.</p> <h1 id="how-to-build-one">How to build one</h1> <p>To make the classifier, we first need to estimate the class-wise mean and common covariance from the training data. Computing class-wise mean is can be done simply by.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mu[y] = np.mean(X[Y==y],axis=0) </code></pre></div></div> <p>The common covariance can then be calculated as</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>M = np.empty_like(X) for y in cls: M[Y==y,:] = mu[y] S = np.cov((X-M).T)/X.shape </code></pre></div></div> <p>The classifier parameters can then be computed as</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for y in cls: w[y] = S_inv.dot(mu[y]) b[y] = np.log(p[y]) - 0.5* mu[y].T.dot(S_inv).dot(mu[y]) </code></pre></div></div> <p>Predicting class of a new data is simply</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for y in w: W[:,y] = w[y] B[y] = b[y] pred = np.argmax(X.dot(W)+B,axis=1) </code></pre></div></div> <p>That is it! A complete classifier in 20 lines of code. See <strong>Appendix 2</strong> for full code.</p> <h1 id="how-good-are-they">How good are they</h1> <p>Here are the accuracies I got for different datasets on using our classifier. All Accuracies are computed for test sets of the corresponding datasets which are not used in computing the parameters.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>* 83.90% for MNIST * 76.51% for Fashion-MNIST * 37.85% for cifar 10 * 16.67% for cifar 100 </code></pre></div></div> <p>The classifier does well for MNIST and Fashion MNIST, But not so well for both the cifars. All these accuracies are in no way close to the state of the art, which is in high 90 for both MNISTs and cifar 10 and high 70 for cifar 100 (<a href="http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html">link</a>) . Regardless, these are good baselines considering how cheap the computation and effort is required to build them.</p> <h1 id="appendix-0">Appendix 0</h1> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} \log \frac{f_{+1}}{f_{-1}} &= \log \frac{|\Sigma_{-1}|}{|\Sigma_{+1}|} \\&- \frac{1}{2} \left[ (x-\mu_{+1})^T\Sigma_{+1}^{-1}(x-\mu_{+1}) - (x-\mu_{-1})^T\Sigma_{-1}^{-1}(x-\mu_{-1}) \right] \end{aligned} %]]></script> <h1 id="appendix-1">Appendix 1</h1> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} h(x) &= \operatorname{arg max} \left( p_yf_y(x) \right) \\ &= \operatorname{arg max} \left( \log p_y + \log f_y(x) \right) \\ &= \operatorname{arg max} \left( \log p_y -\frac{1}{2} \log |\Sigma| + \frac{x^T\Sigma^{-1}x}{2} - \frac{\mu_y^T\Sigma^{-1}\mu_y}{2} + \frac{2(\mu_y^T\Sigma^{-1})x}{2} \right) \\ &= \operatorname{arg max} \left( \log p_y - \frac{\mu_y^T\Sigma^{-1}\mu_y}{2} + (\mu_y^T\Sigma^{-1})x \right)\end{aligned} %]]></script> <h1 id="appendix-2">Appendix 2</h1> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def get_linear_classifier(X,Y,cls): mu = {} p = {} w = {} b = {} # class-wise mean and probabilities for c in cls: mu[c] = np.mean(X[Y==c],axis=0) p[c] = (Y==c).sum()/X.shape # common covariance matrix and its inverse M = np.empty_like(X) for c in cls: M[Y==c,:] = mu[c] S = np.cov((X-M).T)/X.shape S_inv = linalg.pinv(S) # classifier parameters for c in cls: w[c] = S_inv.dot(mu[c]) b[c] = np.log(p[c]) - 0.5* mu[c].T.dot(S_inv).dot(mu[c]) return w,b def test_model(w,b,X,Y): W = np.zeros((X.shape,len(w))) B = np.zeros((len(b),)) for c in w: W[:,c] = w[c] B[c] = b[c] pred = np.argmax(X.dot(W)+B,axis=1) acc = sum(pred==Y)/Y.shape return acc </code></pre></div></div>In the post on bayes error, we discussed what is the best classifier if the features are not enough to tell the class apart. We also derived that in such situation, the best classifier isBayes Error2018-02-11T00:00:00+00:002018-02-11T00:00:00+00:00http://www.outerproduct.space/2018/02/11/bayes_error<p>In an ideal world, everything has reason. Every question has a unambiguous answer. The data in sufficient to explain its behaviours, like the class it belongs to.</p> <script type="math/tex; mode=display">\begin{aligned} g(x) = y \end{aligned}</script> <p>In the non ideal world, however, there is always something missing that stops us from knowing the entire truth. $g$ is beyond reach. In such cases we resort to probability.</p> <script type="math/tex; mode=display">\begin{aligned} n(x) = P(y=1|x)\end{aligned}</script> <p>It simply tells us how probable is the data belonging to a class($y=1$) if my observations are $x$.</p> <p><em>If we build a classifier on this data, how good will it be?</em> This is the question Bayes error answers.</p> <h1 id="bayes-error">Bayes Error</h1> <p>Lets say I’ve built a classifier $h$ to predict the class of data. $h(x)=\hat{y}$ is the predicted class and $y$ is the true class. Even ambiguous data needs to come from somewhere, So we assume $D$ is the joint distribution of $x$ and $y$.</p> <script type="math/tex; mode=display">\begin{aligned} er_D[h] = P_D[h(x) \neq y]\end{aligned}</script> <p>Using an old trick to convert probability to expectation, $P[A] = E[1(A)]$, we have</p> <script type="math/tex; mode=display">\begin{aligned} er_D[h] = E_{x,y}[1(h(x)\neq y)] = E_x E_{y|x}[1(h(x)\neq y)]\end{aligned}</script> <p>The inner expectation is easier to solve when expanded.</p> <script type="math/tex; mode=display">\begin{aligned} E_{y|x}[1(h(x)\neq y)] = 1(h(x)\neq +1) P(y=+1|x) + 1(h(x)\neq -1)P(y=-1|x)\end{aligned}</script> <p>Which give the final error to be</p> <script type="math/tex; mode=display">\begin{aligned} er_D[h] = E_x[1(h(x)\neq +1) n(x) + 1(h(x)\neq -1)(1-n(x))]\end{aligned}</script> <p>The last equation means, if the classifier predicts $+1$ for the data, it will contribute $n(x)$ to the error. On the other hand if it predicts $-1$ for the data, the contribution will be $1-n(x)$.</p> <p>The best classifier would predict $+1$ when $n(x)$ is small and $-1$ when $n(x)$ is large. The minimum achievable error is then</p> <script type="math/tex; mode=display">\begin{aligned} er_D = E_x [\min(n(x),1-n(x))]\end{aligned}</script> <p>This error is called <strong>Bayes Error</strong>.</p> <h1 id="references">References</h1> <p><a href="http://drona.csa.iisc.ernet.in/~e0270/Jan-2015/">Shivani Agarwal’s lectures</a></p>In an ideal world, everything has reason. Every question has a unambiguous answer. The data in sufficient to explain its behaviours, like the class it belongs to.