How many ways can you estimate entropy?

I've written a good deal about entropy . But in all that time, I haven't talked about the easiest (though still hard) problem of how to estimate discrete entropy. (See older posts for how to estimate differential entropy.)

As a refresher, for a discrete random variable \(X\) with support \(\mathcal{X}\) and probability mass function \(p(x) = P(X = x)\), we define the Shannon entropy as \[ H[X] = -E[log_{2} p(X)] = -\sum_{x \in \mathcal{X}} p(x) \log_{2} p(x).\] This is a 'population level' value. You can think of it as a parameter of the distribution, something akin to the number of yes-no questions to identify the value of \(X\) upon each new realization you come across.

Of course, in the real world, we often observe a sample from \(P_{X}\), namely \(X_{1}, \ldots, X_{n}\) and want to say something about the distribution that generated that sample. That is, we want to do statistical inference: to make guesses at this uncertainty parameter.

How many different ways can you estimate the entropy of a distribution from a sample? It turns out: a lot1. And researchers are still inventing new ones.

The simplest way to estimate the entropy is to use a naive / plug-in / non-parametric / maximum likelihood estimate2. All of these mean to do the thing you (or at least I) would most naturally do. Estimate each of the probabilities \(p(x)\) as the proportion of your sample that took the value \(x\), call these \(\{\hat{p}(x)\}_{x \in \mathcal{\hat{X}}}\), and compute

\[ \hat{H} = -\sum_{x \in \mathcal{\hat{X}}} \hat{p}(x) \log_{2} \hat{p}(x).\]

This seems like a reasonable enough way to go about the problem. But it turns out the estimator is biased for any finite \(n\), though asymptotically unbiased in the limit as \(n \to \infty\). We can show this relatively easily by applying Jensen's inequality. Jensen's inequality tells us the result of exchanging expectations with function evaluations. In particular, for a concave function like entropy, we have that \[ E[f(X)] \leq f(E[X]).\] What do I mean by entropy being concave? If we consider entropy to be a function of \(\mathbf{p} = \{ p(x) \}_{x \in \mathcal{X}}\) (that is, think of the probabilities as a vector), then we can clearly think of \(H[X]\) as a (multi-dimensional) function \(H(\mathbf{p})\) taking this vector as an input. Entropy turns out to be concave in \(\mathbf{p}\) (i.e. you can prove this), and as such we have that \[ E[H(\hat{\mathbf{p}})] \leq H(E[\hat{\mathbf{p}}]) = H(\mathbf{p}) = H[X].\] We have then that the bias is \[ \text{Bias}(\hat{H}) = E[\hat{H}] - H \leq 0,\] so our naive estimator will always be negatively3 biased.

The fact that our estimate of entropy is biased sounds worrisome. But it shouldn't worry us too much, since in fact any estimator we could come up with for the entropy of a distribution will be biased. See page 1236 of Paninski's Estimation of Entropy and Mutual Information for a slick proof. Thus, all we can do is attempt to mitigate4 the bias, but we can never hope to remove it completely.

In a future post, I'll discuss two methods for dealing with the bias, one very old, the other much more recent.

Because I can't leave a blog post like this without showing a few plots, here's a simple application of entropy estimation. Consider a probability distribution with the probability mass function \[ P(W = w) = C \, w^{-1}, W = 1, \ldots, 50\] where \(C = \left(\sum\limits_{w = 1}^{50} w^{-1}\right)^{-1}.\) This distribution is within the Zipfian family5, and is a popular statistical model of, for instance, frequency of words in a document. (Hence the choice of \(W\) for the random variable.) The distribution itself looks like this:

with some of the words getting much of the probability mass, and many getting only a little (the 'long tail' of the distribution, though not so long as most, since it has finite support).

We can estimate the entropy of the distribution for varying sample sizes \(N\), and we know the true entropy of the distribution is approximately 4.612478:

All of our estimators converge on the truth (except for the Bayesian one, but, well, it's Bayesian) as the sample size increases. I'll talk about two of these estimators ('MM', which stands for Miller-Madow, and 'Bootstrap') in a future post. Both clearly do better than the naive estimator, and the Miller-Madow estimator for a simple reason, and the bootstrap for a very clever reason.

  1. This turns out to be one of those methodological problems that I've been 'doing wrong' for a while now. I didn't realize how subtle entropy estimation is. But now (I hope), I have a basic understanding of some of the intricacies. And hopefully can leave the follies of my youth behind.

  2. Non-parametric because we're treating the distribution as a categorical random variable, instead of specifying any particular distribution like Poisson, binomial, etc. If we did have reason to assume a parametric model, we'd be better off estimating the parameter instead of estimating each of the individual probabilities. Parametric models are nice, when they're well-specified.

  3. Mind you, this doesn't mean any given estimate will be less than the true entropy of the distribution. That would be nice. This statement only says the expected value of the sampling distribution will always be less than true value. Any given realization from the sampling distribution can lie on either side. As usual, we have to be careful in our thinking about statements regarding the sampling distribution.

  4. Always keeping in mind the bias-variance tradeoff. Classical statistics worried about finding minimum variance unbiased estimators (abbreviated MVUE). (See the wonderful theory wrapped up in the Cramér-Rao.) But we might be able to find slightly biased estimators with much better variance that give us an overall smaller mean-squared error. Most of the work I've seen on entropy estimation focuses on minimizing the bias, but many do implicitly account for the variance by computing mean-squared error. This is good, because an unbiased estimator is no good if the actual estimates are scattered far away from the true value.

  5. A power law, for those who care about such things. Personally, after spending a month in Santa Fe, I'm sick and tired of hearing about power laws.