Analyzing and Bootstrapping Our Way to a Better Entropy Estimator — Part I, Analytical

First, the analytical way to get at a better entropy estimator. As we saw in the previous post, it's known that any estimator of the entropy of a discrete distribution will be biased. And it's also known that the maximum likelihood estimator, while biased, is in some way the 'best' estimator (at least for large sample sizes) in that it's the minimax estimator for entropy. So, it might make sense to see if we can patch up the maximum likelihood estimator by trying to account for its bias.

The way to go about this is, what else, a Taylor expansion. Again, Liam Paninski has already done most of the work for us. But let's try and demystify parts of his derivation, which in my opinion goes a little too far off the pure math deep end (though I'm sure he had good reasons for doing so that didn't make it into the paper proper).

We want to make a statement about the bias of the maximum likelihood estimator for entropy \(\hat{H} = H\left(\hat{\mathbf{p}}\right)\), where \(H\) is the usual entropy functional of a probability vector \(\mathbf{p} = (p_{1}, \ldots, p_{d})\), \[ H(\mathbf{p}) = -\sum_{j = 1}^{d} p_{j} \log p_{j}.\] (Logs will be taken base-\(e\), mostly for convenience. We can always convert from nats to bits at the end.) This means we need to know the expected value of \(\hat{H}\). We could try to compute this directly, but remember that the expectation is taken over the sample \((X_{1}, \ldots, X_{n})\) used to compute \(\hat{\mathbf{p}}\), and we don't want to assume any particular distribution for the \(X_{i}\) (this is why the estimator is 'non-parametric'). Instead, we'll use a (multi-dimensional) Taylor expansion of \(\hat{H}\) about the true \(\mathbf{p}\), and hope that the linear approximation holds when \(\hat{\mathbf{p}}\) is close enough to \(\mathbf{p}\). The result looks nasty¹, but the intuition is simple, and we already saw how to do this for a scalar function of two variables. The linear expansion is the same as for a scalar function \(f\), \[ f(x) = f(x^{*}) + (x - x^{*}) f'(x^{*}) + r(x, x^{*}),\] with the scalar derivative \(f'\) replaced by the gradient. Thus, we expand the entropy functional² as \[ \hat{H} = H(\mathbf{p}) + (\hat{\mathbf{p}} - \mathbf{p})^{T} (\nabla H) (\mathbf{p}) + r(\hat{\mathbf{p}}, \mathbf{p})\] where \[ ((\nabla H) (\mathbf{p}))_{j} = - (\log p_{j} + 1).\] You can carry this through (at this point, it's an exercise in dot products and basic facts about probability mass functions), and you'll find that \[ \hat{H} - H(\mathbf{p}) = - D_{KL}(\hat{\mathbf{p}} || \mathbf{p}),\] where \(D_{KL}(\mathbf{q} || \mathbf{p})\) is the Kullback-Leibler divergence, \[ D_{KL}(\mathbf{q} || \mathbf{p}) = \sum_{j = 1}^{d} q_{j} \log \frac{q_{j}}{p_{j}}.\] As Paninski points out, this is another nice way to see that the maximum likelihood estimator is negatively biased, since one of the properties of the Kullback-Leibler divergence is that it's always non-negative, so the expected value of \(- D_{KL}(\hat{\mathbf{p}} || \mathbf{p})\) (and hence the bias) must be non-positive.

With a lot more work (Taylor expand the Kullback-Leilber divergence, etc.) that I will skip, you can show that the bias is given by \[ \text{Bias}(\hat{H}) = E[\hat{H}] - H = - \frac{d - 1}{2 n} + O(n^{-2}).\] Thus, the leading order term in the bias depends on the alphabet size \(d\) and the sample size \(n\), and decreases like \(n^{-1}\) for large \(n\).

Since we know that this is the (approximate) finite-sample bias of the maximum likelihood estimator, it follows that \[ E[\hat{H}] + \frac{d - 1}{2 n} \approx H,\] which motivates a new estimator, the Miller-Madow estimator, \[ \hat{H}_{MM} = \hat{H} + \frac{\hat{d} - 1}{2 n}\] which is approximately unbiased. Note that we had to include another estimator, \(\hat{d}\), in our estimator. This is because we don't necessarily know, a priori, the alphabet size, and have to estimate it from the number of symbols we see in the data. Take, for instance, corpus analysis of texts, where we use a 'bag of words' model. What is the number of words in the English language? I doubt we know this number exactly.

This estimator is nice, because it's very simple. Our analytical derivation has done all of the work up front, and all we have to do is add on the \(\frac{\hat{d} - 1}{2 n}\) term to our naive estimator, and we're already better off. The Miller-Madow estimator is frequently used as a straw man model, at least in the papers I've read about entropy estimation.

In Part II of this 'series,' we'll look at a different estimator, the bootstrap entropy estimator, that, like all bootstrap estimators, requires computational work on each go. This approach (letting the computer do the heavy lifting) has the advantage that we don't have to think very hard, and turns out to work very well in practice.

One of my undergraduate mathematics professors told us a story about how a colleague of his was paid a large sum of money for photocopying the multidimensional Taylor's theorem. In truth, given the context of the story (financial mathematics), I think he photocopied the result for Ito's Lemma. Wikipedia has certainly ruined the monopoly that mathematicians have over theorems.↩
This is where Paninski gets a little too math-y for my tastes. He claims we need to use a Frechet derivative here, which generalizes the usual derivative we're used to. To my eyes, \(H\) doesn't require this machinery: it's just a scalar function of \(\mathbf{p}\).↩