Using Information Theory — Entropy Rate

Here's a simple example for why entropy doesn't make sense as a measure of randomness for a general stochastic process (i.e. a stochastic process with memory, like a Markov chain).

Consider the sequence

01010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101

This could be generated by a Markov chain with two states, 0 and 1, with the probability of self-transition zero in each case. If we tried to estimate the entropy of this sequence, we would see that we have half 0s and half 1s, so our estimate of entropy would be 1. But this doesn't make sense. The sequence isn't random: it's a period-2 orbit. We have to be slightly more sophisticated. Denote the stochastic process by \(X_{1}, X_{2}, \ldots\) with distribution functions over finite collections of symbols given by \(p(x_{1}, \ldots x_{n})\). We'll compute the average block entropies with blocks of length \(n\) as

\(\bar{H}(n) = \frac{1}{n} H[X_{1}, \ldots, X_{n}] = -\frac{1}{n} E\left[\log_{2} p(X_{1}, \ldots, X_{n})\right],\)

For a (strong-sense) stationary stochastic process, \(\left\{\bar{H}(n)\right\}_{n = 1}^{\infty}\) will be a monotonically non-increasing sequence in \(n\) and bounded from below by 0, and thus must converge. The limiting value of this sequence is what we call the entropy rate. If we compute this out for the Markov chain example (either analytically, or by estimating the distribution of 2-words, 3-words, etc.), we'll see that the sequence of averaged block entropies for this Markov chain converge to zero. Thus, the Markov chain has zero entropy rate. The intuition here is captured by an alternative definition of entropy rate,

\(\bar{H} = \lim\limits_{n \to \infty} H[X_{n} | X_{n - 1}, \ldots, X_{1}]\),

as the average surprise per symbol. With this system, we aren't surprised by the new symbol given the previous ones.