Why Entropy?
There are many ways to define the entropy of a random variable. Claude Shannon originally justified his choice on axiomatic grounds. If we force our functional \(H[p(x)]\) to satisfy three properties, then the form of entropy as
\[H[p(x)] = - E[log_{2}(p(X))]\]
falls out of the requirements on \(H[p(x)]\).
A less formal, but equivalent, definition of entropy is in terms of the minimum expected number of binary questions needed to identify the value of a discrete random variable \(X\). This is (from personal experience) a typical starting point for introducing entropy. Let's take this approach, but then follow it to completion1.
We'll start with the example from Cover and Thomas. Let \(X\) be a discrete random variable taking values from the 'alphabet'2 \(\mathcal{X} = \{ a, b, c, d\}\). Thus, we can specify \(X\) by giving its probability mass function,
\[p(x) = P(X = x) = \left\{ \begin{array}{cc} \frac{1}{2} &: x = a \\ \frac{1}{4} &: x = b \\ \frac{1}{8} &: x = c \\ \frac{1}{8} &: x = d \end{array}\right. .\]
Our goal is to ask a series of binary questions of the form 'Is \(X =\) ?' and get the answer 'Yes!' as quickly as possible. Of course, \(X\) is a random variable, so the number of questions we will have to ask isn't a number, but a random variable itself. Thus, we can formalize our goal as follows: we want to minimize the expected number of questions we'll have to ask to get at the identity of \(X\). Call this quantity \(\bar{\ell}\) (the reason for this choice will become clear later on). We can work out this value for the probability mass function above. But first, we have to specify the procedure that we'll use to ask the questions. The obvious choice, and the one we'll make, is to ask first if \(X = a\)? (since \(a\) is the most probable value), and then \(b\), and then \(c\), and so on, until we've arrived at the correct value for \(X\). Given this strategy, we'll ask a single question \(1/2\) of the time (since \(P(X = a) = 1/2\)), two questions 1/4 of the time, and 3 questions 1/8 + 1/8 = 1/4 of the time3. Thus, the expected number of questions we'll have to ask is
\[ \bar{\ell} = \frac{1}{2} \cdot 1 + \frac{1}{4} \cdot 2 + \frac{1}{8} \cdot 3 + \frac{1}{8} \cdot 3 = 1.75.\]
A quick computation shows that 1.75 is also the entropy of \(X\), since4
\[ \begin{align} H[X] &= -\sum_{x \in \mathcal{X}} p(x) \log p(x) \\ &= -\left(\frac{1}{2} \cdot \log \frac{1}{2} + \frac{1}{4} \cdot \log \frac{1}{4} + \frac{1}{8} \cdot \log \frac{1}{8} + \frac{1}{8} \cdot \log \frac{1}{8}\right) \\ &= 1.75. \end{align}\]
Thus, for this case, the entropy and the minimum number of expected questions turn out to coincide. This only occurs when \(p(x)\) is a dyadic distribution5. In the general case where \(p(x)\) is not dyadic, the minimum expected number of questions is bounded between entropy and entropy plus 1. That is, in general,
\[ H[X] \leq \bar{\ell}^{*} < H[X] + 1.\]
Why this should be the case will (hopefully) become more clear in a moment.
Okay, so far we haven't done anything beyond the standard approach taken in Cover and Thomas. But now let's formalize our question asking process a little, and see how we can relate it to data compression. The goal of lossless data compression is to map an alphabet \(\mathcal{X}\) into a new alphabet \(\mathcal{D}\) (which we'll take to be binary, namely \(\mathcal{D} = \{ 0, 1\}\)) as to make the encoded words as short as possible, in expectation. There are a lot of results from data compression. The main result we'll need is the optimality of Huffman codes. I won't go into the details of what a Huffman code is, but suffice it to say this code will get us the question procedure that gives us the minimum expected number of binary questions.
Here is how we will use the Huffman code to encode our alphabet \(\mathcal{X} = \{a, b, c, d\}\). We'll map each element of \(\mathcal{X}\) into a binary string which keeps track of the number of questions we asked as well as the answers to those questions. We'll use the same questioning procedure as before, which will give us the mapping
\[\begin{align} a &\to 1 \\ b &\to 01 \\ c &\to 001 \\ d &\to 000. \end{align}\]
Again, we'll read each binary string from left to right. Thus, the string that \(a\) maps to, namely 1, says, "We've asked 'Is \(X = a\)?' and gotten the answer 'Yes.'" The string that \(b\) maps to says, "We've asked 'Is \(X = a\)?' and gotten the answer 'No.' We then asked 'Is \(X = b\)?' and gotten the answer 'Yes.'" To go one further, the string that \(d\) maps to says, "We've asked 'Is \(X = a\)?' and gotten the answer 'No.' We then asked 'Is \(X = b\)?' and gotten the answer 'No.' We then asked 'Is \(X = c\)?' and gotten the answer 'No.'" (Thus, \(X = d\), by exhaustion of the possibilities.) Note that this code is not unique (for example, we could have asked about \(d\) before asking about \(c\), or interchanged the roles of 0 and 1), but it is optimal.
So now we have a way to formulate our question procedure in an optimal way, and we see that the length of each codeword corresponds to the number of questions we had to ask to get at \(X\)'s identity. Therefore, the expected length of a codeword is equal to the expected number of binary questions, and we have made the missing link between data compression and binary questions.
We still haven't shown that the optimal questioning strategy results in an expected number of binary questions between entropy and entropy plus 1. To get this result, we can now turn again to the ample literature on coding theory, and realize that the Huffman code, which we used to construct our procedure, is a prefix code. One of the (many) nice things about prefix codes is that they satisfy Kraft's inequality. Namely, if for each \(x_{i} \in \mathcal{X}\) we specify a code length \(\ell_{i}\), which is the number of digits that \(x_{i}\) maps to6, and the code is a prefix code, then Kraft's inequality requires that
\[ \sum_{i = 1}^{|\mathcal{X}|} D^{-\ell_{i}} \leq 1,\]
where we have assumed a \(D\)-ary encoding. (For a binary encoding, we took \(D = 2\).) This restriction, by itself, allows us to demonstrate that the expected code length (which, remember, is in a one-to-one mapping with the expected number of binary questions) must be greater than entropy7. To show that the expected code length of the optimal code is bounded above by entropy plus 1, we construct a simpler, but sub-optimal code, called the Shannon-Fano code, which it is easy to show is bounded above by entropy plus 1. Since our Huffman code is optimal, and thus better than the Shannon-Fano code, we must get that overall
\[ H[X] \leq \bar{\ell}^{*} < H[X] + 1\]
where \(\bar{\ell}^{*}\) is the expected code length (expected number of binary questions) for the optimal code ( optimal question procedure).
Following the idea to its natural, and final, conclusion is not something that seems typical. In Cover and Thomas's Elements of Information Theory, for example, the definition of entropy in terms of the minimum expected number of binary questions is given, but then not justified beyond a single example. They promise a justification in Chapter 5 of the text, which covers Data Compression. But as far as I can tell (and I haven't made a careful reading of that chapter), they do not then make the key connection between data compression and entropy that they allude to in Chapter 1. Admittedly, the connection is obvious (if you think about it long enough), but I still the it's worth stating explicitly. A quick Google search found no mentions of this explicit connection, so maybe this post can fill that void.↩
Information theory is overflowing with these electrical engineering-specific words: alphabet, code word, etc. As someone coming to the topic from a probabilistic / statistical background, this can be off-putting at times. But I have to remind myself that the electrical engineers made the first go at information theory, so they have earned the right to name things as they please.↩
Note that we only ever have to ask at most 3 questions. With \(n\) possible values, we'll always have to ask at most \(n - 1\) questions, since on the last question, we know the identify of \(X\) even if the answer is no (we don't need an \(n^\text{th}\) question to identify \(X\)).↩
From here on out, all logarithms are taken base 2.↩
Or to be more exact, when the distribution is \(D\)-adic, where we take our encoded alphabet to be \(\mathcal{D} = \{ 0, 1, \ldots, D- 1\}\). The dyadic case is a special case where \(D = 2.\)↩
In the previous example, the length of the codeword \(a\) maps to is \(1\), so \(\ell_{a} = 1\). Similarly, \(\ell_{b} = 2, \ell_{c} = 3\), and \(\ell_{d} = 3\).↩
The proof is slick, and only involves a straightforward normalization and a simple tool from Information Theory. Of course, as with all slick proofs, it requires a trick that only seems obvious after the fact.↩