Integrated Knowledge Solutions: Entropy

You and your friend are training a neural network for classification. Both of you are using identical training data. The data has four classes with 40% examples of cat images, 10% images of dogs, and 25% each of horse and sheep images. Since the deadline for the project is nearing, both of you decide to run only a few epochs and get to report writing. At the same time, the two of you have a friendly wager of $10 going to the winner of the better model.

At the end of training, you find out that your model, Net1, is making 30% recognition errors and the resulting distribution of assigned labels to the training data is 25% each for four classes. As luck would have it, your friend's model, Net2, is also yielding 30% error rate but the assigned labels in the training set are different with 40% cats, 10% dogs, 10% horse, and 40% sheep.

Since the error rate by both models is identical, your friend declares a tie. You on the other hand are insisting that your model Net1 is slightly better and want the friend to pay.

Who is right? Let's figure it out using the concept of cross entropy.

Cross Entropy

Before defining cross entropy, let's first look at entropy. Entropy is a measure of uncertainty. Think of a training set of images of cats and dogs for an image recognition problem. Assume the training set is having equal number of cat and dog images. If you are asked to blindly pull an image of a cat from the training set, you know that you have a 50% chance of success. Let's change the mix of cat and dog images to increase the percentage of cat images to 70%. Now if you are asked to blindly pull an image of a cat, you know your chances of success are much better this time. You are more certain of success compared to previous mix. Entropy captures this notion of uncertainty and is defined through the following formula:

$E(C) = -\sum_i p(c_i )\log p(c_i)$,

where C stand for a random variable representing class label of a training example. Let $p(c_i)$ stand for the probability of label being $c_i$.

While entropy captures the uncertainty of a single event (a training set in our case), we are often interested in comparing a pair of events (a pair of training sets for example). Letting $C$ and $\hat{C}$ denote the respective random variables representing class labels, the cross entropy is measured by the following formula:

$H(C,\hat{C}) = -\sum_i p(c_i)\log p(\hat{c}_i)$

It can be seen that it is not a symmetric function. Cross entropy measure is popular for training multi-class classification problems because the loss function surface with this criterion is considered better for gradient search.

Now back to the wager. We are going to compute the cross entropy of the class label's distributions produced by Net1 and Net2 with respect the distribution in the training set and see how close models Net1 and Net2 are to the label distribution in the training set. Plugging in the Percentages of the labels produced by Net1 and Net2, we get

$H(C, Net1) = -(0.4*log(0.25) + 0.1*log(0.25)$

+ 0.25*log(0.25) + 0.25*log(0.25) => 2.0

With Net2, the cross entropy is

$H(C, Net1) = -(0.4*log(0.4) + 0.1*log(0.1)$

+ 0.25*log(0.25) + 0.25*log(0.25) => 2.02

To see whether Net1 or Net2 produced labels distribution more closer to the original distribution, let's look at the entropy of the training set. It is

$ E(C) = -(0.4*log(0.4) + 0.1*log(0.1)$

+ 0.25*log(0.25) + 0.25*log(0.25) => 1.86

All above calculations are done using base2. We see that Net1 cross entropy value is slightly closer to the entropy of the training set. This means the result given by Net1 is comparatively more similar to the distribution of labels in the training set. Thus, your friend must pay you $10.

Large language models (LLMs) continue to fascinate us with their capabilities to answer our questions, generate presentations and essays for us and many other assorted tasks. These models are also good at generating code for user specified tasks. However, almost all of them do not run the code for us; they simply give us the code that we can copy and execute. Recently, Google has given its large language model, Bard, the computational capabilities as well. Bard thus not only provides the code but also executes it while answering user's questions.

I wanted to check this feature of Bard. Below is what happened when I asked Bard a question that involved some computation.

Not only generating the code for entropy calculation and running it, Bard went on to explain entropy and its answer.

Google characterizes computing by Bard in response to user questions as "writing code on the fly" method. The company says, "So far, we've seen this method improve the accuracy of Bard’s responses to computation-based word and math problems in our internal challenge datasets by approximately 30%."

The "writing code on the fly by Bard" is not limited to word or math problems; you can ask Bard to do coding and execution for larger problems too. As an example, I asked Bard to generate and execute code for building a regression model with my data being specified via a URL. I was pleasantly surprised to see Bard execute the code, get the results including the plots. You can read about my this interaction with Bard at this post.

Integrated Knowledge Solutions

Pages

Whose Model is Better?

Cross Entropy

Google's Bard Can Code and Compute for You

Search This Blog