Practical Inequalities — Hoeffding's Inequality

A friend had the following problem. His algorithm took as its input a random matrix, performed some operation on that matrix, and got out a result that either did or did not have some property. As the name implies, the random matrix is chosen 'at random.' In particular, each element is a independent and identically distributed normal random variable. There's a whole theory of random matrices, and things can get quite complicated.

Fortunately, for the problem at hand, we didn't need any of that complicated theory. In particular, what we're interested is the outcome of a certain procedure, and the inputs to that procedure are independent and identically distributed. Thus, the outputs are as well. Since the outcome either happens or does not, we really have that the outcomes are Bernoulli random variables, with some success probability \(p\).

The question my friend had can be stated as: how many samples do I have to collect to say with some confidence that I have a good guess at the value of the success probability \(p\)? Anyone who has ever taken an intro statistics course knows that we can answer this question by invoking the central limit theorem. But that means we have to worry about asymptotics.

Let's use a nicer result. In particular, Hoeffding's inequality gives us just the tool we need to answer this question. Hoeffding's inequality¹ applies in a very general case. For Bernoulli random variables, the inequality tells us that if we compute the sample mean \(\hat{p} = \frac{1}{n} \sum\limits_{i = 1}^{n} X_{i}\), then the probability that it will deviate by an amount \(\epsilon\) from the true success probability \(p\) is

\[ P\left(\left|\hat{p} - p\right| > \epsilon\right) \leq 2 \ \text{exp}(-2 n \epsilon^{2}).\]

Notice how this probability doesn't depend on \(p\) at all. That's really very nice. Also notice that it decays exponentially with \(n\) and the square of \(\epsilon\): we don't have to collect too much data before we expect our approximation to \(p\) to be a good one.

What we're interested in is getting the right answer, so flipping things around,

\[ P\left(\left|\hat{p} - p\right| \leq \epsilon\right) \geq 1 - 2 \ \text{exp}(-2 n \epsilon^{2}).\]

If we want to fix a lower bound of the probability of deviating from the true value \(p\) by an amount \(\epsilon\), call this value \(\alpha\), then we can solve for \(n\) and find

\[ n > - \frac{1}{2} \log\left(\frac{1 - \alpha}{2} \right)\epsilon^{-2}.\]

This gives us the number of times we'd have to run the procedure to be \(100\alpha\)% confident our result is within \(\epsilon\) of the true value².

Of course, what we've basically constructed here is a confidence interval. But most people don't want to be bothered with 'boring'³ results from statistics like confidence intervals.

Inequalities make up the bread and butter of modern analysis. And with statistics being a branch of probability theory being a branch of measure theory, they also make for great tools in statistical analysis.↩
This reminded me that Hoeffding's inequality shows up in Probability Approximately Correct results in Statistical Learning Theory. Powerful stuff.↩
Though as we've seen, these results are far from boring, and involve some fun mathematics.↩