You and your friend are training a neural network for classification. Both of you are using identical training data. The data has four classes with 40% examples of cat images, 10% images of dogs, and 25% each of horse and sheep images. Since the deadline for the project is nearing, both of you decide to run only a few epochs and get to report writing. At the same time, the two of you have a friendly wager of $10 going to the winner of the better model.

At the end of training, you find out that your model, Net1, is making 30% recognition errors and the resulting distribution of assigned labels to the training data is 25% each for four classes. As luck would have it, your friend's model, Net2, is also yielding 30% error rate but the assigned labels in the training set are different with 40% cats, 10% dogs, 10% horse, and 40% sheep.

Since the error rate by both models is identical, your friend declares a tie. You on the other hand are insisting that your model Net1 is slightly better and want the friend to pay.

Who is right? Let's figure it out using the concept of cross entropy.

### Cross Entropy

Before defining cross entropy, let's first look at entropy. Entropy is a measure of uncertainty. Think of a training set of images of cats and dogs for an image recognition problem. Assume the training set is having equal number of cat and dog images. If you are asked to blindly pull an image of a cat from the training set, you know that you have a 50% chance of success. Let's change the mix of cat and dog images to increase the percentage of cat images to 70%. Now if you are asked to blindly pull an image of a cat, you know your chances of success are much better this time. You are more certain of success compared to previous mix. Entropy captures this notion of uncertainty and is defined through the following formula:

$E(C) = -\sum_i p(c_i )\log p(c_i)$,

where C stand for a random variable representing class label of a training example. Let $p(c_i)$ stand for the probability of label being $c_i$.

While entropy captures the uncertainty of a single event (a training set in our case), we are often interested in comparing a pair of events (a pair of training sets for example). Letting $C$ and $\hat{C}$ denote the respective random variables representing class labels, the cross entropy is measured by the following formula:

$H(C,\hat{C}) = -\sum_i p(c_i)\log p(\hat{c}_i)$

It can be seen that it is not a symmetric function. Cross entropy measure is popular for training multi-class classification problems because the loss function surface with this criterion is considered better for gradient search.

Now back to the wager. We are going to compute the cross entropy of the class label's distributions produced by Net1 and Net2 with respect the distribution in the training set and see how close models Net1 and Net2 are to the label distribution in the training set. Plugging in the Percentages of the labels produced by Net1 and Net2, we get

$H(C, Net1) = -(0.4*log(0.25) + 0.1*log(0.25)$

+ 0.25*log(0.25) + 0.25*log(0.25) => 2.0

With Net2, the cross entropy is

$H(C, Net1) = -(0.4*log(0.4) + 0.1*log(0.1)$

+ 0.25*log(0.25) + 0.25*log(0.25) => 2.02

To see whether Net1 or Net2 produced labels distribution more closer to the original distribution, let's look at the entropy of the training set. It is

$ E(C) = -(0.4*log(0.4) + 0.1*log(0.1)$

All above calculations are done using base2. We see that Net1 cross entropy value is slightly closer to the entropy of the training set. This means the result given by Net1 is comparatively more similar to the distribution of labels in the training set. Thus, your friend must pay you $10.