Jekyll20200405T07:40:38+00:00http://www.outerproduct.space/feed.xmlOuter Product SpaceThis is where the description of your site will go. You should change it by editing the _config.yml file. It can be as long as you like! Happy blogging... ❤Expected Calibration Error20200120T00:00:00+00:0020200120T00:00:00+00:00http://www.outerproduct.space/2020/01/20/expectedcaliberationerror<p>ECE quantifies how much you can trust the class confidences your model gives. It is the difference between the predicted confidence and reality.</p>
<h2 id="howtocomputethecaliberationerror">How to compute the Caliberation Error?</h2>
<p>Computing calibration error is simple. First run your model over a set of samples and collect all the predictions. We need to compute two things. The accuracy of the predictions and the average confidence of all the predictions.</p>
<p>The absolute difference between the two is calibration error.</p>
<div class="languagepython highlighterrouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">compute_accuracy</span><span class="p">(</span><span class="n">predictions</span><span class="p">,</span> <span class="n">targets</span><span class="p">):</span>
<span class="k">assert</span> <span class="n">predictions</span><span class="o">.</span><span class="n">shape</span><span class="o">==</span><span class="n">targets</span><span class="o">.</span><span class="n">shape</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">predictions</span><span class="o">==</span><span class="n">targets</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">compute_caliberation_error</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">,</span> <span class="n">gt_idxs</span><span class="p">):</span>
<span class="n">predicted_class_idxs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">,</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">acc</span> <span class="o">=</span> <span class="n">compute_accuracy</span><span class="p">(</span><span class="n">predicted_class_idxs</span><span class="p">,</span> <span class="n">gt_idxs</span><span class="p">)</span>
<span class="n">conf</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">,</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">absolute</span><span class="p">(</span><span class="n">acc</span><span class="o"></span><span class="n">conf</span><span class="p">),</span><span class="n">acc</span><span class="p">,</span><span class="n">conf</span>
</code></pre></div></div>
<h2 id="expectedcaliberationerror">Expected Caliberation Error</h2>
<p>A classifier is said to be well calibrated if it has a low ECE. I came across Expected Calibration Error from a recent, aptly titled paper “<a href="https://arxiv.org/abs/1912.03263"><em>Your classifier is secretly an energy based model and you should treat it like one</em></a>”. It defines ECE as</p>
<script type="math/tex; mode=display">ECE = \sum _{m=1}^{M} \frac{B_m}{n} acc(B_m)  conf(B_m)</script>
<p>To compute ECE of a model, we simply bins the predictions first. Then, calculate average of all the calibration errors.</p>
<div class="languagepython highlighterrouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">expected_caliberation_error</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">,</span><span class="n">gt_idxs</span><span class="p">,</span> <span class="n">num_bins</span><span class="o">=</span><span class="mi">20</span><span class="p">):</span>
<span class="n">delta</span> <span class="o">=</span> <span class="mf">1.0</span><span class="o">/</span><span class="n">num_bins</span>
<span class="n">predicted_confidences</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">,</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span><span class="mf">1.0</span><span class="p">,</span><span class="n">delta</span><span class="p">):</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">l</span><span class="o">+</span><span class="n">delta</span>
<span class="c1"># bin the predictions
</span> <span class="n">idxs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argwhere</span><span class="p">((</span><span class="n">predicted_confidences</span><span class="o"><=</span><span class="n">h</span><span class="p">)</span> <span class="o">&</span> <span class="p">(</span><span class="n">predicted_confidences</span><span class="o">></span><span class="n">l</span><span class="p">))</span><span class="o">.</span><span class="n">flatten</span><span class="p">()</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">idxs</span><span class="p">)</span><span class="o">==</span><span class="mi">0</span><span class="p">:</span><span class="k">continue</span>
<span class="c1"># compute caliberation error
</span> <span class="n">ce</span><span class="p">,</span><span class="n">acc</span><span class="p">,</span><span class="n">conf</span> <span class="o">=</span> <span class="n">compute_caliberation_error</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">[</span><span class="n">idxs</span><span class="p">,:],</span> <span class="n">gt_idxs</span><span class="p">[</span><span class="n">idxs</span><span class="p">])</span>
<span class="n">data</span><span class="o">.</span><span class="n">append</span><span class="p">([</span><span class="n">l</span><span class="p">,</span><span class="n">h</span><span class="p">,</span><span class="n">ce</span><span class="p">,</span><span class="n">acc</span><span class="p">,</span><span class="n">conf</span><span class="p">])</span>
<span class="c1"># average the computed caliberation errors
</span> <span class="n">ece</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">([</span><span class="n">ce</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span><span class="n">_</span><span class="p">,</span><span class="n">ce</span><span class="p">,</span><span class="n">_</span><span class="p">,</span><span class="n">_</span> <span class="ow">in</span> <span class="n">data</span><span class="p">])</span>
<span class="k">return</span> <span class="n">ece</span><span class="p">,</span> <span class="n">data</span>
</code></pre></div></div>
<p>For testing, I finetuned a model with resnet18 stem and computed predictions over <a href="https://github.com/fastai/imagenette">ImageWoof</a> dataset.</p>
<p>Before training, the model has very high ECE</p>
<div class="languagepython highlighterrouge"><div class="highlight"><pre class="highlight"><code><span class="n">learner</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">'stage0'</span><span class="p">);</span>
<span class="n">probs</span><span class="p">,</span> <span class="n">targets</span> <span class="o">=</span> <span class="n">learner</span><span class="o">.</span><span class="n">get_preds</span><span class="p">()</span>
<span class="n">class_confidences</span> <span class="p">,</span><span class="n">gt_idxs</span> <span class="o">=</span> <span class="n">probs</span><span class="o">.</span><span class="n">numpy</span><span class="p">(),</span> <span class="n">targets</span><span class="o">.</span><span class="n">numpy</span><span class="p">()</span>
<span class="n">ece</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">expected_caliberation_error</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">,</span><span class="n">gt_idxs</span><span class="p">)</span>
</code></pre></div></div>
<div class="languagepython highlighterrouge"><div class="highlight"><pre class="highlight"><code><span class="n">plot_figure</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="n">delta</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/images/20200120expectedcaliberationerror_files/output_11_0.png" alt="png" /></p>
<p>After training, ECE has considerably reduced.</p>
<div class="languagepython highlighterrouge"><div class="highlight"><pre class="highlight"><code><span class="n">learner</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">'stage1'</span><span class="p">);</span>
<span class="n">probs</span><span class="p">,</span> <span class="n">targets</span> <span class="o">=</span> <span class="n">learner</span><span class="o">.</span><span class="n">get_preds</span><span class="p">()</span>
<span class="n">class_confidences</span><span class="p">,</span> <span class="n">gt_idxs</span> <span class="o">=</span> <span class="n">probs</span><span class="o">.</span><span class="n">numpy</span><span class="p">(),</span> <span class="n">targets</span><span class="o">.</span><span class="n">numpy</span><span class="p">()</span>
<span class="n">ece</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">expected_caliberation_error</span><span class="p">(</span><span class="n">class_confidences</span><span class="p">,</span><span class="n">gt_idxs</span><span class="p">)</span>
</code></pre></div></div>
<div class="languagepython highlighterrouge"><div class="highlight"><pre class="highlight"><code><span class="n">plot_figure</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="n">delta</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/images/20200120expectedcaliberationerror_files/output_14_0.png" alt="png" /></p>ECE quantifies how much you can trust the class confidences your model gives. It is the difference between the predicted confidence and reality.Gradient Through Concatenation20190723T00:00:00+00:0020190723T00:00:00+00:00http://www.outerproduct.space/2019/07/23/gradientthroughconcatenation<p>Concatenation of vector is a common operation in computational graph of modern day Deep Learning Networks. This post describes how to compute derivative of the output w.r.to the parameters of concatenation.</p>
<script type="math/tex; mode=display">z = C(x,y)</script>
<p>Where $C$ is concat operation. We are interested in computing $\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\partial y} $</p>
<p>Assuming $x\in \mathbb{R}^m$
and $x\in \mathbb{R}^n$</p>
<p>We can rewrite the concat operation as</p>
<script type="math/tex; mode=display">z = \begin{bmatrix}I_m\\0\end{bmatrix}x+\begin{bmatrix}0\\I_n\end{bmatrix}y</script>
<p>which implies</p>
<script type="math/tex; mode=display">\frac{\partial z}{\partial x} = \begin{bmatrix}I_m\\0\end{bmatrix}</script>
<script type="math/tex; mode=display">\frac{\partial z}{\partial y} = \begin{bmatrix}0\\ I_n\end{bmatrix}</script>Concatenation of vector is a common operation in computational graph of modern day Deep Learning Networks. This post describes how to compute derivative of the output w.r.to the parameters of concatenation.Gradient Through Addition with broadcasting20180918T00:00:00+00:0020180918T00:00:00+00:00http://www.outerproduct.space/2018/09/18/gradientthroughadditionwithbradcasting<p>Calculating gradient across an addition op is considered a simple algebra trick. But addition of tensors in real applications allow broadcasting. In this post, we examine how to compute gradient even in such situations. Let us begin with fundamentals.</p>
<h2 id="gradientofaddition">Gradient of Addition</h2>
<p>Consider a simple sequence of operations. $A$ and $B$ are inputs which ultimately leads to computation of a scalar loss/error term $l$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
z &= A+B
\\
l &\dashleftarrow z
\end{aligned} %]]></script>
<p>We are interested in the gradient of both the inputs w.r.to $l$. Lets just go ahead and apply chin rule to get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\partial l}{\partial A} &= \frac{\partial l}{\partial z}
\\
\frac{\partial l}{\partial B} &= \frac{\partial l}{\partial z}
\end{aligned} %]]></script>
<h2 id="additioninneuralnetworks">Addition in Neural Networks.</h2>
<p>Lets examine how addition in a feed forward layer is computed. This revels how additions are generally done in real use cases. Consider a simple linear feed forward layer with $x$ as input. The transformation of the layer is given by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
z &= W^Tx+b
\\
l &\dashleftarrow z
\end{aligned} %]]></script>
<p>The gradients of the parameters w.r.to the loss terms are</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\partial l}{\partial b} &=\frac{\partial l}{\partial z}
\\
\frac{\partial l}{\partial W} &= x \frac{\partial l}{\partial z}^T
\end{aligned} %]]></script>
<p>$W^Tx$ is a vector which is of same size dimensions as $b$ if $x$ is also a vector. But in practise, networks are always trained with minbatches of samples at a time. Which makes $x$ a 2 dimensional tensor and suddenly the quantities $W^Tx$ and $b$ have different dimensions.</p>
<h2 id="additionwithbroadcasting">Addition with broadcasting</h2>
<p>An implicit assumption we make while adding two multi dimensional quantities is that their dimensions always match. But numerical frameworks allow addition even when the dimensions of the operands are not the same. This is called addition with broadcasting.</p>
<p>Addition is allowed if the arrays are broadcast compatible with each other. <a href="https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html">Numpy’s docs</a> describes two arrays are broadcast compatible if there dimensions are compatible. Two dimensions are compatible if they are</p>
<ol>
<li>Equal</li>
<li>one of them is 1</li>
</ol>
<p>When arrays do not have same number of dimensions, the arrays are compatible if the smaller array’s dimensions can be stretched to both sides by simply adding dimensions of 1 and then the dimensions of both arrays become compatible.</p>
<h2 id="gradientthroughbroadcasting">Gradient through Broadcasting</h2>
<p>Lets unwrap what actually happens during a broadcast operation. For simplicity, lets say we are trying to add two tensors $A$ and $B$. $A$ and $B$ agree on all dimensions except the last where A has a dimension of size $3$ while $B$ is dimensionless.</p>
<p>In this case, $B$ would be broadcast over $A$ to facilitate the addition as follows.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
A+B &=\begin{bmatrix} A_{:,0} & A_{:,1} & A_{:,2} \end{bmatrix}+B
\\&=
\begin{bmatrix} A_{:,0} & A_{:,1} & A_{:,2} \end{bmatrix}+\begin{bmatrix} B & B & B \end{bmatrix}
\\&=
\begin{bmatrix} A_{:,0}+B & A_{:,1}+B & A_{:,2}+B \end{bmatrix}
\end{aligned} %]]></script>
<p>This allows us to visualize the computations better. For the following computation,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
z &= A+B
\\
l &\dashleftarrow z
\end{aligned} %]]></script>
<p>we know that $\frac{\partial l}{\partial A}$ remains the same. But $\frac{\partial l}{\partial B}$ is for the tensor that has been broadcast and hence needs to be adjusted. The gradient needs to be amplified by the factor of the broadcasting. Hence</p>
<script type="math/tex; mode=display">\frac{\partial l}{\partial B} = \mathbf{n}\frac{\partial l}{\partial z}</script>
<p>A curious case to note here that, the gradient of parameter $B$ has a dimension of $A$ in it and hence is dependent on size of $A$. This might be the reason for differing behaviour of neural networks when they are trained with different batch sizes.</p>Calculating gradient across an addition op is considered a simple algebra trick. But addition of tensors in real applications allow broadcasting. In this post, we examine how to compute gradient even in such situations. Let us begin with fundamentals.Softmax Classifier20180328T00:00:00+00:0020180328T00:00:00+00:00http://www.outerproduct.space/2018/03/28/softmax_classifier<p>Imagine we have a dataset ${x,y}_{i=0}^m$ where $x$ is a data point
and $y$ indicates the class $x$ belongs to. For deriving LDA classifier,
we had modelled the class conditional density $P(xy)$ as a gaussian and
derived the posterior probabilities $P(yx)$. Here, we will directly
model the posterior with a linear function. Since the posterior directly
models what class a data point belongs to, we don’t have much to do
after to get a classifier.</p>
<p>But modelling $P(yx)$ with only a linear projection $w^Tx$ has some
problems. There is no easy way to restrict $w^Tx$ to always fall in
$[0,1]$ nor assure that $\sum_k P(y=kx) = 1$.</p>
<p>We want a projection of the data such that it forms a clear probability
distribution over the classes. Since this is near impossible with only a
linear function, we stack a non parametric transformation following the
linear transformation.</p>
<p><strong>Softmax</strong> is a vector valued function defined over a sequence $(z_k)$
as</p>
<script type="math/tex; mode=display">\begin{aligned}
\operatorname{softmax}(z)_k = \frac{\operatorname{exp}[z_k]}{\sum_j\operatorname{exp}[z_j]}\end{aligned}</script>
<p>Softmax preserves the relative magnitude of its input i.e the larger
input coordinate get the larger output value. Softmax also squashes the
values to lie in range $[0,1]$ and makes their sum equal to $1$.</p>
<script type="math/tex; mode=display">\begin{aligned}
\sum_k \frac{\operatorname{exp}[z_k]}{\sum_j\operatorname{exp}[z_j]} = \frac{\sum_k \operatorname{exp}[z_k]}{\sum_j\operatorname{exp}[z_j]} = 1\end{aligned}</script>
<p>So our classifier is</p>
<script type="math/tex; mode=display">\begin{aligned}
P(yx) = \operatorname{softmax}(w^Tx)\end{aligned}</script>
<h1 id="derivativeofthesoftmaxfunction">Derivative of the Softmax function</h1>
<p>We will need the derivative of softmax function later on. So let’s
figure out what it is. We can begin by writing softmax in a concise
form.</p>
<script type="math/tex; mode=display">\begin{aligned}
s_k = \frac{e_k}{\Sigma}\end{aligned}</script>
<p>where
$e_k = \operatorname{exp}[z_k]$ and
$\Sigma = \sum_j\operatorname{exp}[z_j]$. With
$\frac{\partial e_k}{\partial z_k} = e_k$ and
$\frac{\partial \Sigma}{\partial z_p} = e_p$, we can easily derive the
derivative for softmax function as follows.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ when $p \neq k$}
\\
\frac{\partial s_k}{\partial x_p} &= e_k\left[ \frac{1}{\Sigma^2} e_p\right] \\&= s_ks_p
\\
\text{ when $p = k$}
\\
\frac{\partial s_k}{\partial x_p} &=
\frac{ e_k \Sigma e_p e_k}{\Sigma^2} \\&= s_ks_ps_k
\\
\text{in general}
\\
\frac{\partial s_k}{\partial x_p} &= s_k(\delta_{kp}  s_p)\end{aligned} %]]></script>
<p>$\delta_{kp}$ is dirac delta function which is $1$ only when $k=p$ and
$0$ otherwise.</p>
<h1 id="estimatingmodelparameterusinglikelihood">Estimating Model Parameter using Likelihood</h1>
<p>Now that we have a complete model of the classifier,
$P(yx) = \operatorname{softmax}(w^Tx)$, all that is remaining is to
estimate the model’s parameters $w$ from the dataset. We can begin by
using
<a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">likelihood</a>
of the model explaining the training data.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
L(w) &= \prod_x \prod_k P(kx;w)^{y_k}
\\
&= \prod_x \prod_k \operatorname{softmax}(w_k^Tx)^{y_k} \end{aligned} %]]></script>
<p>Likelihood gives a measure of how much the model explains the data
${x,y}$ for a given parameter $w$. To get the optimum value for the
parameter, all we have to do find the value of $w$ which maximises the
likelihood.</p>
<p>The likelihood function is a bit difficult to work with on its own. But
we take negative of the log of likelihood function<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> so that all
products gets converted to sum and all exponentials gets converted to
products. <script type="math/tex">% <![CDATA[
\begin{aligned}
E(w) &= \log L(w)
\\
&= \sum_x \sum_k y_k \log s_k\end{aligned} %]]></script></p>
<p>All we have to do to get the best parameter is to minimise the negative
log likelihood (thereby maximising likelihood)</p>
<script type="math/tex; mode=display">\begin{aligned}
w_{opt} = \operatorname{argmin}_w E(w)\end{aligned}</script>
<p><em>Note: If we think of $y$ as a probability distribution of data
belonging to each of the classes and $s$ as our model’s prediction of
the same, then $E(w)$ is cross entropy between the two distributions.</em></p>
<p>For computing $w_{opt}$, we simply have to find derivative of $E(w)$
w.r.to $w$ and equate it to 0. But finding derivative over then entire
$w$ is difficult and nonintuitive. So let’s break it down and find
derivatives over each the columns $w_p$ separately.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\nabla&_{w_p} E(w)
\\
&= \sum_x \sum_k y_k \frac{1}{s_k} \frac{\partial s_k}{\partial w_p}
\\
&=\sum_x \sum_k y_k \frac{1}{s_k} \frac{\partial s_k}{\partial z_p} \frac{\partial z_p}{\partial w_p}
\\
&=\sum_x \sum_k y_k \frac{1}{s_k} s_k(\delta_{kp}  s_p) x
\\
&= \sum_x \sum_k y_k (\delta_{kp}  s_p) x\end{aligned} %]]></script>
<p>$\sum_k y_k (\delta_{kp}  s_p)$ can be expanded as
$\sum_k y_k \delta_{kp}  s_p\sum_k y_k$. Only $\delta_{pp}=1$ and
$\sum y_k =1$. Thus the original term evaluates to $y_p  s_p$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\nabla_{w_p} E(w) &= \sum_x (s_p  y_p) x
\\
\nabla_{w} E(w) &= \sum_x (s  y) x\end{aligned} %]]></script>
<p>Now we have a problem here. Setting $\nabla_w E(w)=0$ does not give any
information about $w$. However, what the gradient do tells is that in
space of $w$, which is the direction to move (change $w$) so that the
change in $E$ is the largest. So if we move in the exact opposite
direction (negative of gradient), we will get maximum reduction in $E$.
Since $E$ is measuring the difference between model’s prediction and the
true labels, decreasing $E$ would mean our model is getting better.
Enter <strong>Gradient Descent</strong>.</p>
<h1 id="gradientdescentalgorithm">Gradient Descent Algorithm</h1>
<p>Gradient descent is exactly that; gradientdescent. If you want to
minimise a function, keep moving in negative direction of its gradient
(thereby descending).</p>
<p><a href="https://en.wikipedia.org/wiki/Gradient_descent">Gradient descent</a> is an
optimisation algorithm for cases where there is no analytic or easy
solution for the parameters, but gradients of the models can be computed
at each point. The algorithm simply says, if $L$ is some loss function
which measures how good the model is with parameter $\theta$, then we
can update $\theta$ as to make the model better. <script type="math/tex">\begin{aligned}
\theta_{new} = \theta  \alpha \nabla_\theta L\end{aligned}</script></p>
<p>Now repeat the same with new $\theta_{new}$ and the model with keep
getting better. There are some caveats however. The behaviour of
gradient descent entirely depends on choice of step size $\alpha$ and
even then, convergence to global optimum is not guaranteed.</p>
<p>In practise however, gradient descent performs well. We do have some
tricks to pick the (seemingly) best step size and some other ways to
ensure model improves. In our case, the update step is simply,</p>
<script type="math/tex; mode=display">\begin{aligned}
w_{new} = w  \alpha \nabla_{w}E(w)\end{aligned}</script>
<p>To compute the true gradient direction of the model, we need to evaluate
the model over every possible data. This is impossible because (1)
intractable amount of computation and (2) we don’t have labels for all
possible data (if we did, what is the point of building a classifier).</p>
<p>So, we are computing an approximate model gradient over the data that we
have. But even this is a lot of work for getting an approximate gradient
and that is not end of the story. We have to keep iterating. Since we
are approximating, why not approximate further. Pick a sample at random,
compute gradient over it and update the model. This is Stochastic
Gradient Descent.</p>
<p><a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">Stochastic Gradient
Descent</a> is
not that stable. Gradient of each samples do not agree with each other
and hence, the frequent updates simply results in random motion in the
parameter space. To make more stable, we compute gradient over a small
set of samples or a batch. This is batched stochastic gradient descent.
Batched SGD updates the model more frequently than pure gradient descent
and is more stable than vanilla SGD. In fact, batched SGD is so commonly
used that now it has become the vanilla. SGD now refers to batched SGD.</p>
<p>This is how you implement SGD.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
idx &\gets 1 \\
W &= initialise\_weights() \\
x_b,y_b &= next(get\_batch)\\
g_b &= \nabla_W s_b \\
W &\gets W  \alpha g_b
\end{aligned} %]]></script>
<h1 id="implementingsoftmaxclassifieranditsgradients">Implementing Softmax classifier and its gradients</h1>
<p>Implementing the forward prediction of the classifier is pretty straight
forward. First we have to do a matrix vector multiplication to implement
$z=w^Tx$ and then pointwise exponentiate all the terms in $z$.</p>
<p>However, we have to sum all of these exponentiated terms to get the
denominator in the softmax step. Since exponentiation creates huge
numbers if components of $z$ are greater than $1$, this creates some
numerical errors.</p>
<p><img src="figures/exponentialcurve" alt="image" />{width=”0.7\linewidth”}</p>
<p>To get rid of the numerical errors we use the following trick.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\operatorname{softmax}(z)_k &= \frac{e^{z_k}}{\sum_i e^{z_i}}
\\
&= \frac{e^{M}e^{z_k}}{e^{M}\sum_i e^{z_i}}
\\
&= \frac{e^{z_kM}}{\sum_i e^{z_iM}}\end{aligned} %]]></script>
<p>If we choose $M$ large enough such that all the terms in the powers are
negative or $0$, all the exponentiated terms will be small. So we set
$M = \max z_i$.</p>
<p>Following code sample shows how the model’s prediction is implemented.
The code has been vectorised so that it can predict for a batch of $x$
at once.</p>
<div class="languageplaintext highlighterrouge"><div class="highlight"><pre class="highlight"><code>def get_predictions(x,W):
z = np.matmul(x,W.T)
M = np.max(
z,
axis=1,
keepdims=True
)
e = np.exp(zM)
# normalisation trick so
# that largest of z is 0
sigma = e.sum(
axis=1,
keepdims=True
)
s = e/sigma
return s
</code></pre></div></div>
<p>Unlike forward pass, implementing gradient is very simple. It’s only an
outer product between two vectors $sy$ and $x$. But, when we implement
it for a batch of samples and it’s predictions, the outer product can be
implemented as a matrix multiplication. See the code sample below.</p>
<div class="languageplaintext highlighterrouge"><div class="highlight"><pre class="highlight"><code>def get_batch_gradient(
x_b, # input
y_b, # target
s_b # prediction
):
g_b = np.matmul(
(s_by_b).T,
x_b
)
return g_b/batch_size
</code></pre></div></div>
<h1 id="modelperformance">Model Performance</h1>
<p><a href="https://www.cs.toronto.edu/~kriz/cifar.html">Cifar 10</a> is a image
dataset having 10 image classes. In this section, we test our simple
softmax classifier’s performance on it.</p>
<p>To tune the model to get optimum performance, we first need to find the
best hyperparameters (batch size and learning rate). For this, we first
split the training set into two. A smaller set for validation and the
rest to be used solely for training. The model is trained only on this
new training set and validation set will be like our proxy test set
while we find the best hyper parameter.</p>
<p>So we train the model for relative shorter duration (10000 gradient
updates) and see its performance on validation set for different choices
of hyper parameters. The following table list model performances for
different hyper parameter combinations.</p>
<hr />
<div class="languageplaintext highlighterrouge"><div class="highlight"><pre class="highlight"><code> 4 8 16 32 $10^{1}$ 0.20 0.16 0.28 0.31 $10^{2 }$ 0.22 0.29 0.35 0.38 $10^{3 }$ 0.33 0.34 0.36 0.36 $10^{4 }$ 0.29 0.30 0.30 0.30 $10^{5}$ 0.18 0.19 0.18 0.19
64 128 256 512 $10^{1}$ 0.26 0.29 0.22 0.28 $10^{2 }$ 0.38 0.39 0.40 0.40 $10^{3 }$ 0.37 0.37 0.37 0.37 $10^{4 }$ 0.31 0.30 0.30 0.30 $10^{5}$ 0.19 0.19 0.19 0.17         
</code></pre></div></div>
<p>Batch size $256$ and learning rate $0.01$ gave the best performance. So
we will train the model for longer duration(1000000 gradient updates)
with these parameters. The following plot shows validation loss during
training progress.</p>
<p><img src="figures/validationloss" alt="image" /></p>
<p>We get a final test accuracy of 38.05%.</p>
<h1 id="conclusion">Conclusion</h1>
<p>LDA model gave us 37.85% accuracy on Cifar 10 dataset. The softmax
classifier is giving us 38% accuracy. It appears to be a close tie
between both the models, but one important distinction is that LDA
distinctly modelled the data as gaussian while we made no such
assumption while designing the softmax classifier.</p>
<p>Our simple linear classifier appear useless when compared to bigger and
complex models(CNNs) that achieves near perfect accuracy on cifar 10.
But there is some values in learning these simple ones first. They do
teach some very valuable lessons about data modelling. They are also
very good to test implementing optimiser algorithms like SGD we have
implemented for this post. Do test them out on some other problems.</p>
<h1 id="code">Code</h1>
<p>The code is
<a href="https://github.com/nithishdivakar/blogpostcodes/tree/master/softmaxclassifiers">here</a>.</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>$\log$ is a strictly increasing function <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Imagine we have a dataset ${x,y}_{i=0}^m$ where $x$ is a data point and $y$ indicates the class $x$ belongs to. For deriving LDA classifier, we had modelled the class conditional density $P(xy)$ as a gaussian and derived the posterior probabilities $P(yx)$. Here, we will directly model the posterior with a linear function. Since the posterior directly models what class a data point belongs to, we don’t have much to do after to get a classifier.Differentiable Computations20180218T00:00:00+00:0020180218T00:00:00+00:00http://www.outerproduct.space/2018/02/18/differentiablecomputations<p>Auto gradient is a nice feature found in many computational frameworks.
Specify the computation in forward direction and the framework computes
backward gradients. Let’s talk about the generic method to do this.</p>
<p>Let’s say we have to compute the result of ‘something’. It may be a
nasty heat equation or some logic driven steps to get from input to
output. Abstracting the steps involved gives us a sequence of equations
<script type="math/tex">\begin{aligned}
z_i = f_i(z_{a(i)})\end{aligned}</script></p>
<p>The $z$’s are intermediate variables of the computation steps or they may be parameters. The selections $z_{a(i)}$ are inputs to $f_i$.</p>
<p><em>What does gradient of this sequence of computation mean?</em></p>
<p>If is the final step of the computation, then computing gradients of the
sequence means i</p>
<p>$\frac{\partial z_n}{\partial z_i}$ are the gradients if $z_n=f_n(z_{a(n)})$ is the final step. Computing all those gradients gives us how parameters change w.r.to the output.</p>
<h1 id="handlingbranchesandloops">Handling Branches and loops</h1>
<p>For any general computation to be included, we need to talk about
branches and loops. How are these handled in our model?</p>
<p>Conditional branches can be represented by indicator functions. See
<a href="https://en.wikipedia.org/wiki/Indicator_function#Derivatives_of_the_indicator_function">this
entry</a>
for details on computing derivative of indicator functions.</p>
<p>Loops could be unrolled in to a sequence of functions. All of them would
simply share a same parameters, but inputs will be output of the
function representing previous iteration. For example</p>
<div class="languageplaintext highlighterrouge"><div class="highlight"><pre class="highlight"><code>begin loop 1:3
x = x + c
end
</code></pre></div></div>
<p>can be unrolled as</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
x_1 &= x + c
\\x_2 &= x_1 + c
\\x_3 &= x_2 + c\end{aligned} %]]></script>
<p>This won’t work for infinite loops because the unrolling will never end.
Infinite loops has no business in real world computation. If a loop
cannot be unrolled even after applying the “reality of the universe”, we
are not talking about a computational system . It might be an event loop
or a queue. Neither needs gradients!</p>
<h1 id="forwardcomputationasaconstrainedoptimisationproblem">Forward computation as a Constrained optimisation problem</h1>
<p>Without loss of generality, we can say that all this hoopla of computing
gradient is to minimise the final value. Even if this is not the case,
like for example, if maximising the final result was the intent, then
append a negating function at the end of the sequence. There are many
other techniques out there to convert different problems to a
minimization problem.</p>
<p>Now that we have <em>that</em> out of the way, lets look at the following
problem.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&\min{z_n}
\\
s.t~z_i &= f_i(z_{a(i)})\end{aligned} %]]></script>
<p>The formulation if a little bit weird. All it is saying is, minimise
$z_n$ such that, outputs of computations ($f_i$) are inputs to some
other computation (all $f$’s which has $z_i$ as input). Constraints are
maintaining integrity of the sequence. So we managed to represent same
thing is two ways, each saying the same thing. Great!</p>
<h1 id="howdoyousolveaconstrainedoptimisationproblem">How do you solve a constrained optimisation problem?</h1>
<p>Using the method of <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange
multipliers</a>. It
basically says that once we define Lagrange’s function</p>
<script type="math/tex; mode=display">\begin{aligned}
L(z,\lambda) = z_n  \sum_i\lambda_i(z_i  f_i(z_{a(i)}))\end{aligned}</script>
<p>These $L$’s gradient w.r.to its parameters vanishes at optimum points of
original function as well. So we get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\nabla_{\lambda_i}=0 &\implies z_i = f_i(z_{a(i)})
\\
\nabla_{z_n}=0 &\implies \lambda_n = 1
\\
\nabla_{z_i}=0 &\implies \lambda_i = \sum_{k\in b(i)}\lambda_k \frac{\partial f_k}{\partial z_i}\end{aligned} %]]></script>
<p>Final expression of $\lambda_i$’s will give
$\frac{\partial z_n}{\partial z_i}$ and hence all the gradients of our
original computation. $b(\cdot)$ is like inverse of $a(\cdot)$. $a(i)$
gives which $z$’s are arguments of $f_i$ while $b(i)$ simply gives which
$f$ has $z_i$ as an argument. $b=a^{1}$ ??. Anyway, these equations
fits nicely as a linear system</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
A\lambda =
\begin{bmatrix}
0\\
\vdots\\
0\\
1
\end{bmatrix}
\quad
; A_{k,i} =
\begin{cases}
\frac{\partial f_k}{\partial z_i} & k\in b(i)
\\ 1 & k=i
\\ 0 & otherwise
\end{cases}\end{aligned} %]]></script>
<p>$A$ is an upper triangular matrix with 1’s on the diagonal. Otherwise,
we are looking at sequence of computation which needs result of a
future. That is too complicated for now(example of explicit systems).</p>
<p>This linear system of equations opens up myriad of possibilities of
computing gradients faster. The simplest of which is back substitution
since $A$ is triangular. If the computation we are dealing with is a
forward pass of a neural network, what we get out of the back
substitution is “backprop" algorithm!!</p>
<h1 id="derivingbackpropinaweirdway">Deriving backprop, in a weird way</h1>
<p>Lets look at a very simple Neural network</p>
<p><script type="math/tex">% <![CDATA[
\begin{aligned}
a_1 &= \sigma(x w_1)
\\
a_2 &= \operatorname{sofmax}(a_1 w_2)
\\
l &= \operatorname{loss}(a_2,y)\end{aligned} %]]></script> If we simplify (ahem!) it
up according to our problem, we get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
z_1&=x,~ z_2=y, z_3=w_1, z_4=w_2
\\z_5 &= z_1z_3
\\z_6 &= \sigma(z_5)
\\z_7 &= z_6z_4
\\z_8 &= \operatorname{softmax}(z_7)
\\z_9 &= \operatorname{loss}(z_8,z_2)\end{aligned} %]]></script>
<p>This gives us the linear system</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{bmatrix}
\\1 & & & & \frac{\partial f_{5}}{\partial z_{1}} & & & &
\\ &1 & & & & & & & \frac{\partial f_{9}}{\partial z_{2}}
\\ & &1 & & \frac{\partial f_{5}}{\partial z_{3}} & & & &
\\ & & &1 & & & \frac{\partial f_{7}}{\partial z_{4}} & &
\\ & & & &1 & \frac{\partial f_{6}}{\partial z_{5}} & & &
\\ & & & & &1 & \frac{\partial f_{7}}{\partial z_{6}} & &
\\ & & & & & &1 & \frac{\partial f_{8}}{\partial z_{7}} &
\\ & & & & & & &1 & \frac{\partial f_{9}}{\partial z_{8}}
\\ & & & & & & & & 1 &
\end{bmatrix}
\begin{bmatrix}
\lambda_{1}\\
\lambda_{2}\\
\lambda_{3}\\
\lambda_{4}\\
\lambda_{5}\\
\lambda_{6}\\
\lambda_{7}\\
\lambda_{8}\\
\lambda_{9}\\
\end{bmatrix}
=
\begin{bmatrix}
0\\
0\\
0\\
0\\
0\\
0\\
0\\
0\\
1
\end{bmatrix}\end{aligned} %]]></script>
<p>Apply back substitution and we get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda_3 &= \lambda_5 \frac{\partial f_5}{\partial z_3}\quad
\lambda_4 = \lambda_7 \frac{\partial f_7}{\partial z_4}\\
\lambda_5 &= \lambda_6 \frac{\partial f_6}{\partial z_6}\quad
\lambda_6 = \lambda_7 \frac{\partial f_7}{\partial z_6}\\
\lambda_7 &= \lambda_8 \frac{\partial f_8}{\partial z_7}\quad
\lambda_8 = \lambda_9 \frac{\partial f_9}{\partial z_8}\\
\lambda_3 &= \frac{\partial l}{\partial z_8} \frac{\partial z_8}{\partial z_7} \frac{\partial z_7}{\partial z_6} \frac{\partial z_6}{\partial z_6} \frac{\partial z_5}{\partial w_1}\\
\lambda_4 &= \frac{\partial l}{\partial z_8} \frac{\partial z_8}{\partial z_7} \frac{\partial z_7}{\partial w_2}\end{aligned} %]]></script>
<p>and there it is!! $\lambda_3$ is the gradient for parameter $w_1$ and $\lambda_4$ represent the gradient of $w_2$.</p>
<p>Now the structure of matrix $A$ for this problem isn’t that interesting.
The example network is very simple. Almost too simple. The computational
graph is almost a line graph. But with more interesting cases, like for
example, <a href="https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf">inception</a>
architecture, the matrix will have very nice structure. A very
particular example is dense block from
<a href="https://arxiv.org/abs/1608.06993">DenseNet</a>. The matrix will have a
fully filled upper triangular.</p>
<p><strong>Attribution</strong> I had my first encounter with the constrained
optimisation view of computation in Yann LeCunn’s <a href="http://yann.lecun.com/exdb/publis/pdf/lecun88.pdf">1988
paper</a> “A
Theoretical Framework for back propagation”. Incidentally, this is the
first paper I understood about deep learning and related field. Do give
it a read.</p>Auto gradient is a nice feature found in many computational frameworks. Specify the computation in forward direction and the framework computes backward gradients. Let’s talk about the generic method to do this.Linear Classifiers20180218T00:00:00+00:0020180218T00:00:00+00:00http://www.outerproduct.space/2018/02/18/linear_classifiers<p>In the <a href="/2018/02/11/bayeserror">post on bayes error</a>, we discussed what
is the best classifier if the features are not enough to tell the class
apart. We also derived that in such situation, the best classifier is</p>
<script type="math/tex; mode=display">\begin{aligned}
h(x) = sign \left( n(x)  \frac{1}{2} \right) \end{aligned}</script>
<p>This
formulation cannot be used in general situations as there is no easy way
to estimate $n(x) = P(y=+1x)$ for a general distribution. But what if
$x$ has a simple distribution?</p>
<p>Lets assume that the data is a gaussian for each class</p>
<script type="math/tex; mode=display">\begin{aligned}
P(xy) = N(\mu_y,\Sigma_y) = f_y\end{aligned}</script>
<p>The parametric form of
$P(xy)$ immediately give us a closed form for $n(x)$ by a simple
application of bayes rule</p>
<script type="math/tex; mode=display">\begin{aligned}
n(x) = \frac{pf_{+1}}{p f_{+1}+(1p)f_{1}}\end{aligned}</script>
<p>which in
turn gives us a simple classifier</p>
<script type="math/tex; mode=display">\begin{aligned}
n(x)  \frac{1}{2} = \frac{pf_{+1}}{pf_{+1}+(1p)f_{1}}  \frac{1}{2} \end{aligned}</script>
<script type="math/tex; mode=display">\begin{aligned}
= \frac{f_{+1}}{f_{1}}  \frac{1p}{p} \end{aligned}</script>
<p>To further simplify, we use a strictly increasing property of $\log$
function and write</p>
<script type="math/tex; mode=display">\begin{aligned}
h(x) = \operatorname{sign} \left( \log \frac{f_{+1}}{f_{1}} \log\frac{1p}{p} \right)\end{aligned}</script>
<p>This gives us simpler form of the classifier</p>
<script type="math/tex; mode=display">\begin{aligned}
h(x) = sign(x^TAx + b^Tx+c)\end{aligned}</script>
<p>where
$A = \Sigma_{1}^{1}\Sigma_{+1}^{1}$.</p>
<p>If we further assume that class covariances are the same(
$\Sigma_{1}=\Sigma_{+1}$), then what we get a linear classifier.
<script type="math/tex">\begin{aligned}
h(x) = \operatorname{sign}(b^Tx+c)\end{aligned}</script></p>
<h1 id="ldaformulticlassclassification">LDA for multiclass classification</h1>
<p><a href="/2018/02/11/bayeserror">Bayes error</a> tell us that in general case, the
classifier that has least error is</p>
<script type="math/tex; mode=display">\begin{aligned}
h(x) = \operatorname{arg max}_y n_y(x) \end{aligned}</script>
<p>where
$n_y(x) = P(y=kx)$ is the classwise densities. In general, the data
need not follow any distribution and hence, the classwise densities
need not have a closed form. To mitigate this, we first assume that the
data does follow a parametric distribution.</p>
<p>We will assume that the class conditional densities $p(xy)$ are
Gaussian distributed. This means that each class of the data is centred
around some point in the data space(classwise mean), the density of the
data belonging to this class decreases as we go further away from this
mean point. We will further assume that all these classwise
distributions have same covariance. Although this assumption is
restrictive, this helps in keeping our classifier simple. Deriving a
variant of classifier which accommodates for different covariances is
fairly straightforward from the following steps. Thus, we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(xy) & = f_y(x) \sim N(\mu_y, \Sigma)
\\
\mu_y &= \frac{1}{m_y} \sum_{x\in class[y]}x
\\
\Sigma &= \frac{1}{m} \sum_{(x,y)} (x\mu_y)(x\mu_y)^T\end{aligned} %]]></script>
<p>With these analytic forms, it is easy to get a closed form for $n_y$ by
a straight forward application of Bayes rule with $p_y=p(y=k)$ and we
get</p>
<script type="math/tex; mode=display">\begin{aligned}
n_y(x) = \frac{p_y f_y(x)}{\sum_yp_y f_y(x)}\end{aligned}</script>
<p>From here, getting our classifier in only a matter of simplifying the
equations. So we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
h(x) &= \operatorname{arg max} p_yf_y(x)
\\
&= \operatorname{arg max} w_y^Tx + b_y\end{aligned} %]]></script>
<p>Where $w_y = \Sigma^{1}\mu_y$ and
$b_y = \log p_y  \frac{1}{2} \mu_y^{T}\Sigma^{1}\mu_y$. See <strong>Appendix
1</strong> for full derivation.</p>
<h1 id="howtobuildone">How to build one</h1>
<p>To make the classifier, we first need to estimate the classwise mean
and common covariance from the training data. Computing classwise mean
is can be done simply by.</p>
<div class="languageplaintext highlighterrouge"><div class="highlight"><pre class="highlight"><code>mu[y] = np.mean(X[Y==y],axis=0)
</code></pre></div></div>
<p>The common covariance can then be calculated as</p>
<div class="languageplaintext highlighterrouge"><div class="highlight"><pre class="highlight"><code>M = np.empty_like(X)
for y in cls:
M[Y==y,:] = mu[y]
S = np.cov((XM).T)/X.shape[0]
</code></pre></div></div>
<p>The classifier parameters can then be computed as</p>
<div class="languageplaintext highlighterrouge"><div class="highlight"><pre class="highlight"><code>for y in cls:
w[y] = S_inv.dot(mu[y])
b[y] = np.log(p[y])  0.5* mu[y].T.dot(S_inv).dot(mu[y])
</code></pre></div></div>
<p>Predicting class of a new data is simply</p>
<div class="languageplaintext highlighterrouge"><div class="highlight"><pre class="highlight"><code>for y in w:
W[:,y] = w[y]
B[y] = b[y]
pred = np.argmax(X.dot(W)+B,axis=1)
</code></pre></div></div>
<p>That is it! A complete classifier in 20 lines of code. See <strong>Appendix
2</strong> for full code.</p>
<h1 id="howgoodarethey">How good are they</h1>
<p>Here are the accuracies I got for different datasets on using our
classifier. All Accuracies are computed for test sets of the
corresponding datasets which are not used in computing the parameters.</p>
<div class="languageplaintext highlighterrouge"><div class="highlight"><pre class="highlight"><code>* 83.90% for MNIST
* 76.51% for FashionMNIST
* 37.85% for cifar 10
* 16.67% for cifar 100
</code></pre></div></div>
<p>The classifier does well for MNIST and Fashion MNIST, But not so well
for both the cifars. All these accuracies are in no way close to the
state of the art, which is in high 90 for both MNISTs and cifar 10 and
high 70 for cifar 100
(<a href="http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html">link</a>)
. Regardless, these are good baselines considering how cheap the
computation and effort is required to build them.</p>
<h1 id="appendix0">Appendix 0</h1>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\log \frac{f_{+1}}{f_{1}} &=
\log
\frac{\Sigma_{1}}{\Sigma_{+1}}
\\&
\frac{1}{2}
\left[
(x\mu_{+1})^T\Sigma_{+1}^{1}(x\mu_{+1})

(x\mu_{1})^T\Sigma_{1}^{1}(x\mu_{1})
\right] \end{aligned} %]]></script>
<h1 id="appendix1">Appendix 1</h1>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
h(x) &= \operatorname{arg max} \left( p_yf_y(x) \right)
\\
&= \operatorname{arg max} \left( \log p_y + \log f_y(x) \right)
\\
&= \operatorname{arg max} \left( \log p_y \frac{1}{2} \log \Sigma + \frac{x^T\Sigma^{1}x}{2}  \frac{\mu_y^T\Sigma^{1}\mu_y}{2} + \frac{2(\mu_y^T\Sigma^{1})x}{2} \right)
\\
&= \operatorname{arg max} \left( \log p_y  \frac{\mu_y^T\Sigma^{1}\mu_y}{2} + (\mu_y^T\Sigma^{1})x \right)\end{aligned} %]]></script>
<h1 id="appendix2">Appendix 2</h1>
<div class="languageplaintext highlighterrouge"><div class="highlight"><pre class="highlight"><code>def get_linear_classifier(X,Y,cls):
mu = {}
p = {}
w = {}
b = {}
# classwise mean and probabilities
for c in cls:
mu[c] = np.mean(X[Y==c],axis=0)
p[c] = (Y==c).sum()/X.shape[0]
# common covariance matrix and its inverse
M = np.empty_like(X)
for c in cls:
M[Y==c,:] = mu[c]
S = np.cov((XM).T)/X.shape[0]
S_inv = linalg.pinv(S)
# classifier parameters
for c in cls:
w[c] = S_inv.dot(mu[c])
b[c] = np.log(p[c])  0.5* mu[c].T.dot(S_inv).dot(mu[c])
return w,b
def test_model(w,b,X,Y):
W = np.zeros((X.shape[1],len(w)))
B = np.zeros((len(b),))
for c in w:
W[:,c] = w[c]
B[c] = b[c]
pred = np.argmax(X.dot(W)+B,axis=1)
acc = sum(pred==Y)/Y.shape[0]
return acc
</code></pre></div></div>In the post on bayes error, we discussed what is the best classifier if the features are not enough to tell the class apart. We also derived that in such situation, the best classifier isBayes Error20180211T00:00:00+00:0020180211T00:00:00+00:00http://www.outerproduct.space/2018/02/11/bayes_error<p>In an ideal world, everything has reason. Every question has a
unambiguous answer. The data in sufficient to explain its behaviours,
like the class it belongs to.</p>
<script type="math/tex; mode=display">\begin{aligned}
g(x) = y \end{aligned}</script>
<p>In the non ideal world, however, there is always something missing that
stops us from knowing the entire truth. $g$ is beyond reach. In such
cases we resort to probability.</p>
<script type="math/tex; mode=display">\begin{aligned}
n(x) = P(y=1x)\end{aligned}</script>
<p>It simply tells us how probable is the
data belonging to a class($y=1$) if my observations are $x$.</p>
<p><em>If we build a classifier on this data, how good will it be?</em> This is
the question Bayes error answers.</p>
<h1 id="bayeserror">Bayes Error</h1>
<p>Lets say I’ve built a classifier $h$ to predict the class of data.
$h(x)=\hat{y}$ is the predicted class and $y$ is the true class. Even
ambiguous data needs to come from somewhere, So we assume $D$ is the
joint distribution of $x$ and $y$.</p>
<script type="math/tex; mode=display">\begin{aligned}
er_D[h] = P_D[h(x) \neq y]\end{aligned}</script>
<p>Using an old trick to convert probability to expectation, $P[A] = E[1(A)]$, we have</p>
<script type="math/tex; mode=display">\begin{aligned}
er_D[h] = E_{x,y}[1(h(x)\neq y)] = E_x E_{yx}[1(h(x)\neq y)]\end{aligned}</script>
<p>The inner expectation is easier to solve when expanded.</p>
<script type="math/tex; mode=display">\begin{aligned}
E_{yx}[1(h(x)\neq y)] = 1(h(x)\neq +1) P(y=+1x) + 1(h(x)\neq 1)P(y=1x)\end{aligned}</script>
<p>Which give the final error to be</p>
<script type="math/tex; mode=display">\begin{aligned}
er_D[h] = E_x[1(h(x)\neq +1) n(x) + 1(h(x)\neq 1)(1n(x))]\end{aligned}</script>
<p>The last equation means, if the classifier predicts $+1$ for the data,
it will contribute $n(x)$ to the error. On the other hand if it predicts
$1$ for the data, the contribution will be $1n(x)$.</p>
<p>The best classifier would predict $+1$ when $n(x)$ is small and $1$
when $n(x)$ is large. The minimum achievable error is then</p>
<script type="math/tex; mode=display">\begin{aligned}
er_D = E_x [\min(n(x),1n(x))]\end{aligned}</script>
<p>This error is called <strong>Bayes Error</strong>.</p>
<h1 id="references">References</h1>
<p><a href="http://drona.csa.iisc.ernet.in/~e0270/Jan2015/">Shivani Agarwal’s
lectures</a></p>In an ideal world, everything has reason. Every question has a unambiguous answer. The data in sufficient to explain its behaviours, like the class it belongs to.