As the Urn Turns: Horseraces and Sampling

Since we're well outside of election season, let's talk about something non-topical: polling results. As I said before, I'm doing a directed reading with a student at Maryland looking at parts of Nate Silver's methodology for election prediction. For the last few weeks of the semester, we're going through his methodology for Senate races. A methodology he calls 'modestly complex.'

One of the first points he makes is that we want to aggregate polls (because of the Law of Large Numbers), but we also want to account for the changing opinions within the US population.

We can set up a simple model for polling: the ball and urn model. Suppose we have an urn with \(N\) balls in it, \(\beta\) of which are blue and \(\rho\) of which are red. (Clearly, \(N = \beta + \rho\).) We sample \(n\) balls from the urn, either with or without replacement, mark down the number of blue balls that we saw, and call this number \(B\). \(B\) is a random variable, and will follow one of two distributions: binomial, if we sample with replacement, or hypergeometric, if we sample without replacement. We then use \(B\) to make a guess at \(\beta\). For example, a good guess at the number of blue balls in the urn is just \(\hat{\beta} = N \frac{B}{n}\).

During an election, we don't have just a single poll (or a single draw \(B\) from our urn). Instead, we have several polls¹, call these \(B_{1}, \ldots, B_{m}\). With these polls, we can use the Law of Large Numbers to our advantage and guess

\[\hat{\beta}_{m} = N \left(\frac{1}{m} \sum_{i = 1}^{m} \frac{B_{i}}{n_{i}}\right).\]

Notice that the polls don't have to interview the same number of people: for this to work, we don't need the \(B_{i}\) to be identically distributed, just independent.

But this has all assumed that the urn isn't changing. We haven't added or subtracted any balls from the urn over time, and none of the balls change color. This needn't be the case². We can account for that by allowing \(\beta_{i}\) to change over time. But now the straight averaging doesn't make as much sense: we decrease the variance in our estimate, but increase the bias, since what we really care about is the outcome on election day. Silver accounts for this by using an exponential moving average: weight the polls proximate in time greater than the polls distal in time.

But now I want to talk about something a little different. The 'horse race effect.' That is, how 24-hour news networks treat any modest change in the polls as a sign of something. 'So and so is up 5 percentage points today,' and then 'Mr. Blah is down 3 percentage points.' Of course, this is all (mostly) a load of non-sense. You would expect those sorts of fluctuations in the polls just due to the process of sampling. In essence, you're plotting a sequence of independent and identically distributed draws as a time series, and placing significance in the time index that just isn't there. At least not in the short term.

Through the magic of computers, we can simulate a bunch of polls from our two types of urns: the 'stationary' urn and the changing urn. To put some numbers on things, I used the popular vote results from the 2012 presidential elections³. 65,907,213 votes for Obama and 60,931,767 for Romney⁴. So we have \(N = 126838980\), \(\beta = 65907213\) and \(\rho = 60931767\)⁵. Gallup polls 1000 people in each of its daily polls, so \(n = 1000\). Given this set of numbers, we could make any probability statement we want about the poll results. But let's simulate, since in the real world, we don't have access to all possible world lines, just our own.

Here are polling percentages from the two types of urns over a one month period.

Can you guess which one has the changing urn? It's not obvious. I'll add a line showing the proportion of blue balls in the urn over time to guide the eye.

So the one on the top had the trend: over the 31 days, the proportion of blue balls in the urn went from 50% to 54%. That's a pretty big 'swing' in the population. But we see lots of little 'swings' in both urns, and those are a natural part of the variability from only sampling a fraction of the voting population. (1000 people is a lot less than 126838980.) Those 'swings,' the ones that CNN, Fox News, and MSNBC report on, don't tell us anything. In Silver's homage to the signal processing community, they're noise, not signal.

One more thing to say about the plots before I finish, since they make a very nice demonstration of an basic stats idea that many people mangle, namely, confidence intervals⁶. The 'error bars' in these plots are 95% confidence intervals for the proportion of blue balls in the urn. In fact, I constructed the confidence intervals using Hoeffding's inequality. I imagine most polling agencies construct their confidence intervals using the asymptotic normality of the proportion estimator. (Because in addition to being the obvious way to estimate a proportion from a random sample, it's also the Maximum Likelihood Estimator for the proportion). For a sample size of 1000, the two methods give you basically the same result.

Let's see how the National Council on Public Polls tells us we should interpret the confidence intervals:

Thus, for example, a "3 percentage point margin of error" in a national poll means that if the attempt were made to interview every adult in the nation with the same questions in the same way at the same time as the poll was taken, the poll's answers would fall within plus or minus 3 percentage points of the complete count's results 95% of the time.

That's a valid interpretation. But I'll let Larry Wasserman's explain a better one, from his All of Statistics:

Some texts interpret confidence intervals as follows: if I repeat the experiment over and over, the interval will contain the parameter 95 percent of the time. This is correct but useless since we rarely repeat the same experiment over and over. A better interpretation is this:

On day 1, you collect data and construct a 95 percent confidence interval for a parameter \(\theta_{1}\). On day 2, you collect new data and construct a 95 percent confidence interval for an unrelated parameter \(\theta_{2}\). On day 3, you collect new data and construct a 95 percent confidence interval for an unrelated parameter \(\theta_{3}\). You continue this way constructing confidence intervals for a sequence of unrelated parameters \(\theta_{1}, \theta_{2}, \theta_{3}, \ldots\) Then 95 percent of your intervals will trap the true parameter value. There is no need to introduce the idea of repeating the same experiment over and over.

Remember, when we're doing frequentist statistics, we care about the frequency properties of the methods we use. A 95 percent confidence interval is a statement about how confident we are in the procedure that we're using, not about our confidence on any given time that we use that procedure. For this particular case, if we look at the second set of plots, our confidence interval should capture the red lines about 95 percent of the time. We see that this time around, we got lucky and captured the true population value every time. Also notice that the confidence intervals for the changing urn still worked. This is why the interpretation of "[asking] the same questions in the same way at the same time as the poll was taken" is useless. We don't need to be doing that for our confidence intervals to work!

I doubt we'll see a day when newscasters begin the polling segments of their coverage with, "Well, nothing all that significant has happened in the polls today, so instead lets throw it over to this video of kittens." But we can hope.

For example, Gallup does one per day.↩
Coming from a 'swing state,' I hear a lot about the mythical undecided voter.↩
Yes, I do know what the electoral college is. And yes, I do realize that the US president isn't determined by the popular vote. But pretending otherwise makes the analysis significantly simpler.↩
I really have to doubt that they know these numbers down to the single digits. But let's pretend.↩
I'm also pretending we only have two types of balls in the urn. Because, let's be honest, we live with a two party system.↩
I would guess that \(P\)-values and confidence intervals are two of the most misunderstood ideas in frequentist statistics.↩

Why Interpolation Isn't Enough

I wrote this as a response to this post titled Is predictive modeling different from interpolation? Do we really need stats?

My answer to both questions is a resounding yes.

In one word: generalization.

In many more words:

When you say 'interpolation' of the data, I assume you don't mean threading a function through all of the data points, but rather doing some sort of least-squares fitting over intervals (e.g. using splines a la Hastie, Tibshirani, and Friedman's chapter 5).

The goal of statistical modeling is to distinguish between what's there by coincidence (noise) and what's there because of something inherent to the object of study (signal). Usually, we care about the signal, but not the noise. Without statistics, you're modeling noise + signal, and the noise doesn't generalize. If we didn't think there was noise (coincidences) in the data, then you're correct. We might as well throw out our probabilistic model and use approximation theory, maybe applying an approach like Nick Trefethen's chebfun:

http://www.maths.ox.ac.uk/chebfun/

But the noise doesn't generalize, so the interpolated function will get us further from the 'truth,' on average, then using a regression-based method.

How to build 'model free' regression methods is an open question. See, for example, section 2.2 from this paper:

http://www.rmm-journal.de/downloads/Article_Wasserman.pdf

But even here, we don't assume that the data are right.

If the world weren't noisy, we could just use interpolation and throw out the stochastic considerations inherent in statistical models. But large parts of the world are noisy.

Excel Statistics

Attention Conservation Notice: Far too many words (again) on why I think (good) statistics matters and (bad) statistics should be destroyed.

I pontificate a lot about statistics. I knew it had gotten bad when a friend commented to me recently that he liked my rant, and I had to ask which one. I think I fight so hard for statistics because there was a period in my life (mostly my undergraduate years) when I thought statistics wasn't a field worth studying. It's just means, medians, and modes. We learned that stuff in elementary school. Why would anyone want to become a statistician?

After beginning to study nonlinear dynamics and complex systems, I saw statistics everywhere. A lot of it was shoddily done. Some of it was not. And then I took mathematical statistics the Spring semester of my senior year, and realized that statistics is not, in fact, a castle built in the sky. Rather, it's a castle solidly constructed, with foundations in measure theory. Lots of smart people worked on making it the field it is today. There are some philosophical differences amongst its practitioners, but overall the field offers a lot of tools for studying the stochastic world we live in.

And yet. And yet I still find myself frequently railing against statistics. Like when I see a course like this. Do I think that sociologists should learn more statistics? Yes. But not like this. No joke, one of the examples from the first day handout proposes doing a \(t\)-test to determine whether married or unmarried individuals have more sex.

And I quote:

People who are married have sex an average of 78 times a year; unmarried people have sex on average 53 times a year. The \(t\)-test value is 7.29 indicating that the difference is statistically significant.

The hypothesis is supported-cannot be rejected-given these measurements.

Let's not worry about whether the estimator they're using is asymptotically \(t\)-distributed. (Do these students even remember what asymptote means?) Much more important is that this question isn't interesting. Statistics got its start answering sociological questions, about life and death, crime and punishment. Poisson proved a generalization of the law of large numbers so that he could use it in designing a fair jury system for courts.

And now sociologists in training run \(t\)-tests to see who has more sex.

During an office conversation with a friend, we coined a new phrase: 'Excel statistics.' That's the sort of statistics this handout advocates. 'Let's do a hypothesis test, because that involves math! Which makes it science!'¹ No. 'Let's a fit a line through the data, and report an \(R^{2}\) value!' No. 'Let's...'

Readily available computing has brought statistics to the masses. Which should be a good thing. Unfortunately, the masses aren't ready for statistics.

'More science, fewer \(t\)-tests!'↩

Maximum Likelihood and Entropy

Cosma Shalizi posted recently about optimization for learning. This is a recurring theme in statistics¹: set up a functional combining empirical risk and a regularization term for smoothing, then use optimization to find a parameter value that minimizes this functional. Standard examples include ridge regression, LASSO, smoothing splines, support vector machines, and many, many more. As usual, Shalizi does a masterful job of introducing the key ideas behind this approach, even presenting a simple maximum likelihood example (for the Pareto distribution), along with R code.

After reading through his post and working through the Pareto distribution example, I realized two things: (i) I'd never really understood why maximum likelihood estimation works and (ii) I'd never really realized the important connection between log likelihoods and information theoretic quantities like entropy and Kullback-Leibler divergence. The first part is understandable: I've had this experience many, many times. The second part is embarrassing: both involve logarithms of mass / density functions, so the connection should have been obvious.

But even after making the second realization (the presence of logs in both subjects), I couldn't find any material explicitly making this connection. The closest thing from the cursory googling was this. But it didn't quite answer my question.

So, along the lines of my other post on entropy, let's follow the connection between maximum likelihood estimation and information theory to its completion. Consider this as a supplement to Shalizi's post, focusing explicitly on maximum likelihood estimation.

First, what is maximum likelihood estimation? Let's consider a random sample \(X_{1}, \ldots, X_{n}\) of random variables, all with common density² \(f(x; \theta_{0})\). By specifying \(\theta_{0}\) like this, I'm assuming we're doing parametric inference: fixing a (possibly vector) parameter \(\theta_{0}\) and the probability family (like normal, exponential, Pareto, etc.) gives us the density \(f(x; \theta_{0})\) with which we can make any probability statement of interest. Of course, since we're doing statistical inference, \(\theta_{0}\) is unknown. Our goal is to figure it out: we formulate a good statistic of the data \(X_{1}, \ldots, X_{n}\), call it \(T\), and estimate our parameter as \(\hat{\theta} = T(X_{1}, X_{2}, \ldots, X_{n})\). A statistic that you're familiar with is the sample mean of a random sample,

\[T(X_{1}, \ldots, X_{n}) = \frac{1}{n} \sum_{i = 1}^{n} X_{i}.\]

The maximum likelihood method for determining \(\hat{\theta}\) is as follows. Write down the joint density of the random sample, \[\begin{align*} f_{X_{1}, \ldots, X_{n}}(x_{1}, \ldots, x_{n}; \theta) &= f_{X_{1}}(x_{1}; \theta) \ldots f_{X_{n}}(x_{n}; \theta)\\ &= \prod_{i = 1}^{n} f_{X_{i}}(x_{i}; \theta) \\ &= \prod_{i = 1}^{n} f_{X}(x_{i}; \theta) \end{align*}\] where the factorization of the joint density comes from independence and the last line follows from the identical-ness of the sample. Notice that we're not fixing \(\theta\). Remember: we don't know \(\theta\). So we can think of this joint density as a function of \(\theta\), as well as a function of the arguments \((x_{1}, \ldots, x_{n}).\) When we think of the density from this perspective, we call it the likelihood, and would usually write \[L(\theta; x_{1}^{n}) = \prod_{i = 1}^{n} f_{X}(x_{i}; \theta)\] where \(x_{1}^{n}\) is shorthand for the vector \((x_{1}, \ldots, x_{n})\). We then evaluate the likelihood at our random sample, which gives us \[L(\theta; X_{1}^{n})\] and choose the \(\theta\) value that maximizes. That is, we take our estimator to be \[ \hat{\theta} = \arg \max_{\theta} L(\theta; X_{1}^{n}).\] Of course, \(X_{1}^{n}\) is random, and thus \(L(\theta; X_{1}^{n})\) is a random function, making \(\hat{\theta}\) a random variable. Sometime we can find an explicit function mapping us from the sample to the statistic, in which case we could write \(\hat{\theta} = T(X_{1}, X_{2}, \ldots, X_{n}).\) If not, we proceed by numerical optimization.

In words, we choose the parameter value that maximizes the likelihood of the data. This makes intuitive sense, but why does it work?

In practice, we minimize the negative log-likelihood. That is, we seek \[ \hat{\theta} = \arg \min_{\theta} - \log L(\theta; X_{1}^{n}).\] All the times I've seen this presented, the usual reason given is that this approach turns products into sums, which makes the minimization easier. This is true, but there's a deeper reason for doing this. The logarithm puts us into the domain of information theory, which we can use to show that maximum likelihood makes sense³.

Let's pull the logarithm through the product, giving \[- \log L(\theta; X_{1}^{n}) = - \sum_{i = 1}^{n} \log f(X_{i}; \theta).\] Finally, if we divide by \(n\), giving us the sample negative log-likelihood, we get \[- \frac{1}{n} \sum_{i = 1}^{n} \log f(X_{i}; \theta).\] I got stuck here for a while. By the (Strong or Weak) Law of Large Numbers, this should converge to its expected value (almost surely or in quadratic mean). More specifically, by the asymptotic equipartition theorem, I thought this should converge to the differential entropy of \(X\), namely \[h[f(x; \theta_{0})] = - E[\log f(X; \theta_{0})] = - \int_{\mathbb{R}} f_{X}(x; \theta_{0}) \log f_{X}(x; \theta_{0}) \, dx\] But we have to be careful about how we take the expectation. We have \[- \frac{1}{n} \sum_{i = 1}^{n} \log f(X_{i}; \theta).\] which doesn't fix \(\theta\) at \(\theta_{0}\), and the asymptotic equipartition theorem tells us \[- \frac{1}{n} \sum_{i = 1}^{n} \log f(X_{i}; \theta_{0}) \to h[f(x; \theta_{0})],\] but nothing about how \[- \frac{1}{n} \sum_{i = 1}^{n} \log f(X_{i}; \theta)\] should behave. To get to where we're going, we have to pull the usual mathematical trick of adding zero. If we pull the negative inside and then add and subtract the log-likelihood under the true model, we get \[\begin{align} \frac{1}{n} \sum_{i = 1}^{n} \left\{- \log f(X_{i}; \theta) + \log f(X_{i}; \theta_{0}) - \log f(X_{i}; \theta_{0})\right\} &= \frac{1}{n}\sum_{i = 1}^{n} \left\{ \log \frac{f(X_{i}; \theta_{0})}{f(X_{i}; \theta)} - \log f(X_{i}; \theta_{0}) \right \} \\ &= \frac{1}{n}\sum_{i = 1}^{n} \log \frac{f(X_{i}; \theta_{0})}{f(X_{i}; \theta)} - \frac{1}{n}\sum_{i = 1}^{n} \log f(X_{i}; \theta_{0}) \end{align}\] Now this converges (again, either almost surely or in quadratic mean) to \[ D_{KL}(f(x; \theta_{0}) || f(x; \theta)) + h[f(x; \theta_{0})],\] where \[ D_{KL}(f(x; \theta_{0}) || f(x; \theta)) = \int_{\mathbb{R}} f(x; \theta_{0}) \log \frac{f(x; \theta_{0})}{f(x; \theta)} \, dx\] is the Kullback-Leibler divergence or relative entropy between \(f(x; \theta_{0})\) and \(f(x; \theta)\). Thus, we see that the mean negative log-likelihood converges to the differential entropy under the true distribution plus the Kullback-Leibler divergence between the true distribution and the distribution we guess at. One can show⁴ that the Kullback-Leibler divergence is non-negative and is zero only when \(f(x; \theta_{0}) = f(x; \theta)\) almost surely. Recalling that we're seeking to minimize the mean negative log-likelihood, this means that we want to choose \(\theta = \theta_{0}\) to minimize this limiting function.

And after a bit of a detour through information theory, we've seen a sketch as to why Maximum Likelihood Estimation makes sense as a procedure for estimating a parameter \(\theta\). The mean negative log-likelihood converges to a non-random function, and that non-random function takes its minimum at the correct answer to our question. Fully proving consistency of the maximum likelihood estimator requires a good deal more work.

But that's the beauty of sketches.

And machine learning. But a lot of people always overlook statistics when talking about learning, so I'll show my prejudice.↩
I'm technically going to be talking about differential entropy throughout, since the Pareto distribution from Shalizi is a continuous distribution. But the same results hold for discrete distributions and thus discrete (aka Shannon) entropy.↩
I'm going to be super hand-wavy here. Less hand-wavy than an undergraduate statistics class, for sure, but too hand-wavy for a graduate asymptotic statistics class. My apologies.↩
And I won't. The proof involves an application of Jensen's inequality.↩

Lines, Springs, and Statistical Thinking

Attention Conservation Notice: A lesson that you can use any statistical method you want, but some are certainly better than others.

I follow Rhet Allain's blog Dot Physics. In fact, his most recent post about replacing the terms 'hypothesis', 'theory', and 'scientific law' with 'model' in the scientific lexicon is spot on. That's something I feel extremely comfortable with as a mathematician: 'most models are wrong, some are useful.'

But another recent post of his, about finding the spring constant for a spring-mass system made my ears perk up. In high school and college, my favorite part of labs was always the data analysis (and my least favorite part was the labwork, which goes a long way towards explaining why I'm now training to be a mathematician / statistician).

With the briefest of introductions, the model for a spring-mass system (without drag) falls out of the linear, second-order, constant coefficient ordinary differential equation¹

\[m y'' + ky = 0\]

where \(y\) is the displacement of the spring away from its equilibrium position, \(m\) is the mass on the spring, and \(k\) is the spring constant. This follows directly from Hooke's law², namely that the force acting on the mass is proportional to the negative of the displacement away from the equilibrium position, with the proportionality tied up in this thing we call the spring constant \(k\). From this alone we can derive the equation that we're interested in, namely the relationship between the period of a spring's oscillation and the spring constant. Skipping over those details, we get that

\[T = \frac{2 \pi}{\sqrt{k}} \sqrt{m}.\]

The usual lab that any self-respecting student of the sciences has performed is to place various different masses on the spring, displace the mass a bit from its equilibrium position, and measure its period (hopefully over a few cycles). Once you have a few values \((m_{1}, T_{1}), \ldots, (m_{n}, T_{n})\), you can then go about trying to find the spring constant. Doing so is of course a matter of regression, which brings us into wheelhouse of statistics.

And since we're in statistics land, let's write down a more realistic model. In particular. the thing that is being measured (and by undergraduates, even!) is the period as a function of the mass. Undergraduates are a fickle lot, with better things to think about then verifying a 350+ year old law. Their attention may waver. Their reaction speed may be lacking. Not to mention physical variations in the measurement system. All of this to say that the model we really³ have is

\[T_{i} = \frac{2 \pi}{\sqrt{k}} \sqrt{m_{i}} + \epsilon_{i}\]

where \(\epsilon_{i}\) is an additive error which, for the sake of simplicity, we'll assume has mean zero and variance \(\sigma^{2}\) not depending on \(m\). We'll also assume the errors are uncorrelated. This isn't so unreasonable of a model, compared to some of the probabilistic models out there.

As written, it seems obvious what we should do⁴. Namely, the model for \(T\) is linear in \(\sqrt{m}\), and we know the values of the \(m_{i}\) (to pretty high precision, if I recall correctly from my college physics days). On top of that, the Gauss-Markov theorem tells us linear regression is the best we can do to find the spring constant (out of the class of all unbiased estimators for it). So we should fit a linear regression of \(T\) against \(\sqrt{m}\), and read off \(k\) by simple algebra.

Allain takes a different tact, which at first seems just as good. Namely, he notices that if we square both sides of

\[T = \frac{2 \pi}{\sqrt{k}} \sqrt{m},\]

we get

\[T^{2} = \frac{(2 \pi)^{2}}{k} m,\]

which is now linear in \(T^{2}\) and \(m\). We can fit a line through the points \((m_{i}, T_{i}^{2})\) just as we would have with my method, and read off \(k\) in a similar fashion. But the problem with this approach is that the model isn't right. Remember, the actual model has error in it,

\[T_{i} = \frac{2 \pi}{\sqrt{k}} \sqrt{m_{i}} + \epsilon_{i},\]

so when we square the period, we get

\[T_{i}^{2} = \frac{(2 \pi)^2}{k} m_{i} + 2 \frac{2 \pi}{\sqrt{k}} \sqrt{m_{i}} \epsilon_{i} + \epsilon^{2}_{i}.\]

This is much uglier. And adds a bias to our model. In particular, note that

\[E[T_{i}^{2}] = \frac{(2 \pi)^2}{k} m_{i} + E[\epsilon^{2}_{i}] = \frac{(2 \pi)^2}{k} m_{i} + \sigma^{2}.\]

So as we add more measurement noise, we'll get more and more biased from the true value of the spring constant⁵.

This is a lot of math to say what every experimentalist already knows: when dealing with propagation of errors, it's best to manipulate the thing that has the least amount of error in it. In this case, that means taking the square root of the mass instead of squaring the period.

Theory is nice, but let's follow Allain's lead and do some simulations. I'll take the true value of the spring constant to be \(k = 13.5\), as Allain does in his post. I simulate from the model

\[T_{i} = \frac{2 \pi}{\sqrt{k}} \sqrt{m_{i}} + \epsilon_{i},\]

for each value of \(m_{i}\), and do this \(B = 10000\) times⁶^,⁷. I took \(\epsilon_{i} \sim N(0, \sigma^{2})\), with various different values for \(\sigma\). Allain initially took his \(\epsilon_{i}\) to be Uniform(-1/20, 1/20), so I used that to get an equivalent variance for the normal case, namely \(\sigma = \sqrt{1200} \approx 0.03\). Below I've plotted both what the noisy data looks like (with the true regression curve in red), and the distribution of our different estimators over the 10000 trials.

In the presence of low noise, all but the method that estimates both an intercept and a slope do about equally well⁸. This is good to know. But we would like our methods to continue to work well in the case of large amounts of noise. I can't stress this enough: these experiments are being performed by undergraduates!

So, let's bump up the noise to \(\sigma = 0.1\). This means that 95% of the time, the noise will only push us away from the true period by a magnitude of 0.2 seconds. That's doesn't seem like an unreasonable amount of uncertainty to allow in our period recording procedure. And here's the result:

We can already see the bias creeping into our \(T^2\)-based estimators. Only the regression method using \(\sqrt{m}\) gives a density sharply peaked around the true value. This is unsurprising, given our analysis above, but also comforting. Finally, let's take the noise to an even greater extreme and use \(\sigma = 0.5\). This means that 95% of the time, we'll be off by at most a second from the true value of the period. This is an unreasonable amount of noise (unless the undergraduate performing the lab has had a particularly bad day), but it demonstrates the strengths and weaknesses of the various estimators for \(k\).

At this point, the densities need no explanation. Of the methods proposed in Allain's post, only the '\(m +\) No Intercept' method stills stays in the right ballpark of the true value of \(k\), but is biased by the squaring procedure. I can't stress enough that this amount of noise is unreasonable, especially since the students are presumably measuring the period over several oscillations and thus have the Law of Large Numbers on their side. But it does demonstrate the importance of thinking carefully (and statistically!) about how we would like to do our data analysis.

I have fond memories from my undergraduate days of manipulating equations until they were linear in the appropriate variables, and then fitting a line through them to extract some sort of physical quantity. At the time of that training, I had no statistical sophistication whatsoever. Understandably so. There's barely enough time to cover the science in standard science courses. But it would be nice (for both theorists- and experimentalists-in-training) if a little more statistical thinking made it into the curriculum.

For the sake of reproducibility, I have made my R code available here.

Can you tell I've TAed Elementary Differential Equations twice at UMD?↩
To have lived when you could have such a simple law named after you!↩
Not in the ontological sense, but rather in the sense of being more useful.↩
Though 'obvious' is a loaded word. I've been spending too much time thinking about regression, in all its various forms, lately.↩
Unless we add an (artificial) intercept to our regression model. The nice thing about (simple) physical models: we have reasons to include / not include parameters in our regression.↩
Simulation is a wonderful thing. And we've come a long way since when statisticians had to flip coins to get statistical fodder. I read a story once about Eugen Slutsky's study of apparent cycles in economic data. In the process of performing his simulations, he really did flip a coin thousands of times to simulate a random walk. Unfortunately, I can't find the article where I originally read this, so I'll just state it as a anecdote and hope I'm not misremembering.↩
You'll also notice that I took my masses (called 'design points' in the jargon of regression) to be at more regularly spaced intervals. Namely, I took them to be from 0.2 to 0.3, with spacing 0.01. This seems like a reasonable enough choice of masses, though again, asking our undergraduate to measure 21 periods may be a bit much. Besides, we certainly won't do better with less design points than this.↩
The degradation in performance for the '\(m +\) Intercept' model can be explained in terms of the well-known bias-variance tradeoff. But adding the intercept term to our squared period model, we've allowed the slope to be estimated correctly (by capturing the \(\sigma^{2}\) term within the intercept), thus reducing the bias. But by forcing our procedure to estimate two quantities, we've increased the variance in both. In other words: if physics tells us to use a simpler model, for heaven's sake, use the simpler model!↩

Good Statistics

When it comes to meta-models of statistics, here are two philosophies that I respect:

(My) Bayesian approach, which I associate with E. T. Jaynes, in which you construct models with strong assumptions, ride your models hard, check their fit to data, and then scrap them and improve them as necessary.

At the other extreme, model-free statistical procedures that are designed to work well under very weak assumptions-for example, instead of assuming a distribution is Gaussian, you would just want the procedure to work well under some conditions on the smoothness of the second derivative of the log density function.

Both the above philosophies recognize that (almost) all important assumptions will be wrong, and they resolve this concern via aggressive model checking or via robustness. And of course there are intermediate positions, such as working with Bayesian models that have been shown to be robust, and then still checking them. Or, to flip it around, using robust methods and checking their implicit assumptions.

Andrew Gelman, from Everyone's trading bias for variance at some point, it's just done at different places in the analyses

See. Frequentists can do good statistics, too.

If it's not obvious, I prefer the second approach.

Why Entropy?

There are many ways to define the entropy of a random variable. Claude Shannon originally justified his choice on axiomatic grounds. If we force our functional \(H[p(x)]\) to satisfy three properties, then the form of entropy as

\[H[p(x)] = - E[log_{2}(p(X))]\]

falls out of the requirements on \(H[p(x)]\).

A less formal, but equivalent, definition of entropy is in terms of the minimum expected number of binary questions needed to identify the value of a discrete random variable \(X\). This is (from personal experience) a typical starting point for introducing entropy. Let's take this approach, but then follow it to completion¹.

We'll start with the example from Cover and Thomas. Let \(X\) be a discrete random variable taking values from the 'alphabet'² \(\mathcal{X} = \{ a, b, c, d\}\). Thus, we can specify \(X\) by giving its probability mass function,

\[p(x) = P(X = x) = \left\{ \begin{array}{cc} \frac{1}{2} &: x = a \\ \frac{1}{4} &: x = b \\ \frac{1}{8} &: x = c \\ \frac{1}{8} &: x = d \end{array}\right. .\]

Our goal is to ask a series of binary questions of the form 'Is \(X =\) ?' and get the answer 'Yes!' as quickly as possible. Of course, \(X\) is a random variable, so the number of questions we will have to ask isn't a number, but a random variable itself. Thus, we can formalize our goal as follows: we want to minimize the expected number of questions we'll have to ask to get at the identity of \(X\). Call this quantity \(\bar{\ell}\) (the reason for this choice will become clear later on). We can work out this value for the probability mass function above. But first, we have to specify the procedure that we'll use to ask the questions. The obvious choice, and the one we'll make, is to ask first if \(X = a\)? (since \(a\) is the most probable value), and then \(b\), and then \(c\), and so on, until we've arrived at the correct value for \(X\). Given this strategy, we'll ask a single question \(1/2\) of the time (since \(P(X = a) = 1/2\)), two questions 1/4 of the time, and 3 questions 1/8 + 1/8 = 1/4 of the time³. Thus, the expected number of questions we'll have to ask is

\[ \bar{\ell} = \frac{1}{2} \cdot 1 + \frac{1}{4} \cdot 2 + \frac{1}{8} \cdot 3 + \frac{1}{8} \cdot 3 = 1.75.\]

A quick computation shows that 1.75 is also the entropy of \(X\), since⁴

\[ \begin{align} H[X] &= -\sum_{x \in \mathcal{X}} p(x) \log p(x) \\ &= -\left(\frac{1}{2} \cdot \log \frac{1}{2} + \frac{1}{4} \cdot \log \frac{1}{4} + \frac{1}{8} \cdot \log \frac{1}{8} + \frac{1}{8} \cdot \log \frac{1}{8}\right) \\ &= 1.75. \end{align}\]

Thus, for this case, the entropy and the minimum number of expected questions turn out to coincide. This only occurs when \(p(x)\) is a dyadic distribution⁵. In the general case where \(p(x)\) is not dyadic, the minimum expected number of questions is bounded between entropy and entropy plus 1. That is, in general,

\[ H[X] \leq \bar{\ell}^{*} < H[X] + 1.\]

Why this should be the case will (hopefully) become more clear in a moment.

Okay, so far we haven't done anything beyond the standard approach taken in Cover and Thomas. But now let's formalize our question asking process a little, and see how we can relate it to data compression. The goal of lossless data compression is to map an alphabet \(\mathcal{X}\) into a new alphabet \(\mathcal{D}\) (which we'll take to be binary, namely \(\mathcal{D} = \{ 0, 1\}\)) as to make the encoded words as short as possible, in expectation. There are a lot of results from data compression. The main result we'll need is the optimality of Huffman codes. I won't go into the details of what a Huffman code is, but suffice it to say this code will get us the question procedure that gives us the minimum expected number of binary questions.

Here is how we will use the Huffman code to encode our alphabet \(\mathcal{X} = \{a, b, c, d\}\). We'll map each element of \(\mathcal{X}\) into a binary string which keeps track of the number of questions we asked as well as the answers to those questions. We'll use the same questioning procedure as before, which will give us the mapping

\[\begin{align} a &\to 1 \\ b &\to 01 \\ c &\to 001 \\ d &\to 000. \end{align}\]

Again, we'll read each binary string from left to right. Thus, the string that \(a\) maps to, namely 1, says, "We've asked 'Is \(X = a\)?' and gotten the answer 'Yes.'" The string that \(b\) maps to says, "We've asked 'Is \(X = a\)?' and gotten the answer 'No.' We then asked 'Is \(X = b\)?' and gotten the answer 'Yes.'" To go one further, the string that \(d\) maps to says, "We've asked 'Is \(X = a\)?' and gotten the answer 'No.' We then asked 'Is \(X = b\)?' and gotten the answer 'No.' We then asked 'Is \(X = c\)?' and gotten the answer 'No.'" (Thus, \(X = d\), by exhaustion of the possibilities.) Note that this code is not unique (for example, we could have asked about \(d\) before asking about \(c\), or interchanged the roles of 0 and 1), but it is optimal.

So now we have a way to formulate our question procedure in an optimal way, and we see that the length of each codeword corresponds to the number of questions we had to ask to get at \(X\)'s identity. Therefore, the expected length of a codeword is equal to the expected number of binary questions, and we have made the missing link between data compression and binary questions.

We still haven't shown that the optimal questioning strategy results in an expected number of binary questions between entropy and entropy plus 1. To get this result, we can now turn again to the ample literature on coding theory, and realize that the Huffman code, which we used to construct our procedure, is a prefix code. One of the (many) nice things about prefix codes is that they satisfy Kraft's inequality. Namely, if for each \(x_{i} \in \mathcal{X}\) we specify a code length \(\ell_{i}\), which is the number of digits that \(x_{i}\) maps to⁶, and the code is a prefix code, then Kraft's inequality requires that

\[ \sum_{i = 1}^{|\mathcal{X}|} D^{-\ell_{i}} \leq 1,\]

where we have assumed a \(D\)-ary encoding. (For a binary encoding, we took \(D = 2\).) This restriction, by itself, allows us to demonstrate that the expected code length (which, remember, is in a one-to-one mapping with the expected number of binary questions) must be greater than entropy⁷. To show that the expected code length of the optimal code is bounded above by entropy plus 1, we construct a simpler, but sub-optimal code, called the Shannon-Fano code, which it is easy to show is bounded above by entropy plus 1. Since our Huffman code is optimal, and thus better than the Shannon-Fano code, we must get that overall

\[ H[X] \leq \bar{\ell}^{*} < H[X] + 1\]

where \(\bar{\ell}^{*}\) is the expected code length (expected number of binary questions) for the optimal code ( optimal question procedure).

Following the idea to its natural, and final, conclusion is not something that seems typical. In Cover and Thomas's Elements of Information Theory, for example, the definition of entropy in terms of the minimum expected number of binary questions is given, but then not justified beyond a single example. They promise a justification in Chapter 5 of the text, which covers Data Compression. But as far as I can tell (and I haven't made a careful reading of that chapter), they do not then make the key connection between data compression and entropy that they allude to in Chapter 1. Admittedly, the connection is obvious (if you think about it long enough), but I still the it's worth stating explicitly. A quick Google search found no mentions of this explicit connection, so maybe this post can fill that void.↩
Information theory is overflowing with these electrical engineering-specific words: alphabet, code word, etc. As someone coming to the topic from a probabilistic / statistical background, this can be off-putting at times. But I have to remind myself that the electrical engineers made the first go at information theory, so they have earned the right to name things as they please.↩
Note that we only ever have to ask at most 3 questions. With \(n\) possible values, we'll always have to ask at most \(n - 1\) questions, since on the last question, we know the identify of \(X\) even if the answer is no (we don't need an \(n^\text{th}\) question to identify \(X\)).↩
From here on out, all logarithms are taken base 2.↩
Or to be more exact, when the distribution is \(D\)-adic, where we take our encoded alphabet to be \(\mathcal{D} = \{ 0, 1, \ldots, D- 1\}\). The dyadic case is a special case where \(D = 2.\)↩
In the previous example, the length of the codeword \(a\) maps to is \(1\), so \(\ell_{a} = 1\). Similarly, \(\ell_{b} = 2, \ell_{c} = 3\), and \(\ell_{d} = 3\).↩
The proof is slick, and only involves a straightforward normalization and a simple tool from Information Theory. Of course, as with all slick proofs, it requires a trick that only seems obvious after the fact.↩

Mind the Gap

Larry Wasserman recently had a post about the gap between the development of a statistical method, and the justification of that method. He gives timelines for various methods, from the standard (like Maximum Likelihood Estimation) to the now-becoming-standard (like Causal Inference), demonstrating the large amount of time that had to pass between when inventors propose a method and when researchers (sometimes the same as the inventors!) prove results about why that method 'makes sense.'

I suppose, for the non-mathematically inclined reader, that I have to explain a little what I mean by 'makes sense.' Any method can be proposed to solve a problem (statistical or otherwise). But for statistical problems, we (sometimes) have the tools to answer whether or not we'll get the right answer, most of the time, with a given method.

As an illustrative example, consider the goal of determining the exponent of a power law model for the distribution of some empirical phenomena. This is a favorite topic of a lot of complex system scientists, especially physicists. For those with no training in statistics (beyond what they get in their undergraduate lab courses), a logical approach might be computing the empirical distribution function of the data, taking the logarithm of that function, and fitting a line through it¹. If the data are power law distributed, then the true² cumulative distribution function of the data will be a power law, and thus the log cumulative distribution function will be linear. Taking the slope of the line of best fit would give an affine transformation of the exponent, and our job is done.

Except that we don't have the true cumulative distribution function. Instead, we have the empirical distribution function, which is itself a stochastic process with some underlying distribution. In the limit of infinite data, the empirical distribution will converge to the true distribution function almost surely. But in the meantime, we don't have many guarantees about it, and we certainly don't know that it will give us a good estimate of the exponent of the power law.

Instead, we should use Maximum Likelihood Estimation, as proposed by Ronald Fisher (and perhaps first used by Carl Friedrich Gauss in some computations about celestial orbits). This method is standard statistical fair³, and its application to power laws is well documented. And unlike the line-of-best-fit-through-the-empirical-distribution-function method, it has provably nice properties. And yet people still continue to use the line-of-best-fit method. Why? Because it makes intuitive sense. But it doesn't 'make sense,' statistically.

All of that to explain the (research) nightmares that keep me up at night: using (or worse, proposing!) a method that doesn't 'make sense.' Which is why Larry Wasserman's post is so consoling. Perhaps I don't immediately know why a method I would like to use makes sense. But hopefully in the future, someone far smarter than me will derive the justification for my method.

To quote Cosma Shalizi, from Section 11.4.3 of his thesis, where he proposes using computational mechanics for continuous-valued, discrete-time stochastic process,

It becomes a problem of functional analysis, so the mathematical foundations, if we aim at keeping to even the present standard of rigor, will get much more complicated. Still, we might invoke the physicist's license to ignore foundations, on the grounds that if it works, the mathematicians will find a way of making it right.

For those interested, he has since made this leap. See his and his graduate student Georg Georg's work on LICORS.

Wasserman's post was prescient. Rob Tibshirani, one of the co-inventors of the LASSO (an \(\ell_{1}\)-regularized least squares method), recently developed a significance test for the LASSO (one of the methods listed in Wasserman's 'The Gap' post). This provides even more (statistically justified!) tools for using the LASSO.

Which means that the method now 'makes sense' even more.

Admittedly, this is the favorite approach of a lot of people, not just physicists. But for a group supposedly interested in nonlinear science, the use of a linear method seems suspect.↩
Okay, 'true' here may be a bit too strong of a word. Let's say 'a good model of the data generating process.'↩
I learned about it in my undergraduate mathematical statistics course.↩

Silver at the Oscars

As part of the Directed Reading Program I'm doing with a student at Maryland, we've decided to dig into Nate Silver's Oscar Predictions for this year.

'Fortunately' for us, Silver only gives us a glimpse of his methodology. He also doesn't supply us with the data he used. I've compiled that data here. In short, here is his approach:

The gory details: I weight each award based on the square of its historical success rate, and then double the score for awards whose voting memberships overlap significantly with the Academy.

I don't know how a weighted average counts as 'gory,' but I guess Silver has a different audience in mind. Let's unpack this methodology, in the hopes that "by relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems."¹

Let \(O^{(j)}\) be the Oscar winner in the \(j^{\text{th}}\) year, \(j = 1, \ldots, 25\). Let \(C_{i}^{(j)}\) be the award winner from the \(i^{\text{th}}\) ceremony, \(i = 1, \ldots, 16\)². Let \(S^{(j)}\) be the set of candidate Oscar winners on the \(j^{\text{th}}\) year. For example, this year we had \[S^{(26)} = \{\text{Amour, Argo, Beasts of the Southern Wild, Django Unchained,}\]

\[\text{Les Miserables, Life of Pi, Lincoln, Silver Linings Playbook, Zero Dark Thirty} \}.\] Thus, we have nine films to predict from, and \(|S^{(26)}| = 9.\) From the description above, Silver wants to construct a function from \(S^{(26)}\) to the reals such that

\[ f(s) = \sum_{i = 1}^{16} \beta_{i} I[C_{i}^{(26)} = s], s \in S^{(26)}.\]

Here \(I[\cdot]\) is the indicator function, which is 1 if its argument is true, and 0 otherwise. For example, if \(s\) = Argo and \(C_{1}^{(26)} =\) Argo, then \(I[C_{1}^{(26)} = s] = 1.\) But if \(s\) = Argo and \(C_{2}^{(26)} =\) Amour, then \(I[C_{2}^{(26)} = s] = 0\). We then choose the film \(s\) that gets the highest weighted vote,

\[ \hat{s} = \arg \max_{s \in S^{(26)}} f(s).\]

Let's consider a few choices of \(\beta_{i}\) and see what sort of prediction they give us. If we take \(\beta_{i} = 1\) for all \(i\), our predictor reduces to

\[f(s) = \sum_{i = 1}^{16} I[C_{i}^{(26)} = s], s \in S^{(26)},\]

which, after arg max-ing, is simply the majority vote. That is, we'll choose the film that got the most awards from the non-Oscar awards that year. This seems like a reasonable enough predictor in the situation where we have no other information about how the the non-Oscar awards relate to the Oscar awarded on a given year. But we do have such information. It's summarized in Nate Silver's table and captured in this tab-delimited file. We can (hopefully) use the past performance of the non-Oscar awards and the Oscars to predict how the Oscars will turn out this year. Loosely, we can try to pick up on correlations between the non-Oscar awards and the Oscars that might be predictively useful. This is exactly what Silver ends up doing, in an ad hoc fashion.

Using the historical data, Silver computes, for each award \(i\), the fraction of years that award and the Oscars agreed on their winning films. Using our handy-dandy indicator functions again, we can write this value \(\hat{p}_{i}\) as

\[\hat{p}_{i} = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} I[O^{(j)} = C_{i}^{(j)}]\]

where \(n_{i}\) is the number of years we have data for award ceremony \(i\), since not all of the award ceremonies go back to 1987. In fact, only ten of the fourteen I've included do. With the \(\hat{p}_{i}\) values in hand for all of the sixteen non-Oscar awards, Silver computes his new weighted vote³ as

\[ f(s) = \sum_{i = 1}^{16} \hat{p}_{i}^2 I[C_{i}^{(26)} = s], s \in S^{(26)}.\]

How does this compare to the 'majority vote' predictor above? Now we account for the number of votes given to each of the candidate films, and how those votes match up with Oscars in the past. In other words, if a lot of the non-Oscar awards vote for a film, but those awards also haven't matched the Oscars winner historically, we'll discount their votes.

So far we've seen two possible choices for the weights \(\beta_{i}\). The first choice for the weights, namely \(\beta_{i} = 1\) for all \(i\), work well in the situation where we have no historical information. The second choice, namely \(\beta_{i} = \hat{p}_{i}^{2}\) seems reasonable enough when all we know is whether a non-Oscar award agreed or didn't with the Oscars in the past. But why the square? And can we choose a different weighting that gets us more standard predictor from the statistical learning literature⁴?

Stay tuned to find out!

A new favorite quote by Alfred Whitehead.↩
Note that Nate Silver uses 16 awards instead of my 14. I removed the American Cinema Editors Award for Best Comedy or Musical and the Golden Globe for Best Musical or Comedy, since they didn't seem to have much to do with the Best Film, which is usually a Drama.↩
Well, not quite yet. Silver also doubles the weight for 'insider' awards. These are awards that have a considerable overlap with the Oscars in terms of their members. So, add that doubling to the model to get back Silver's complete predictive model.↩
My current gut feeling: this feels a lot like logistic regression. We're basically doing a multi-class classification problem. But the fact that the set of films changes every year complicates things a bit.↩

High School Chemistry — Accuracy and Precision

High school chemistry was the first time I heard the words 'accuracy' and 'precision' in a scientific context. The terms were presented by way of a bullseye example (more on that later), and the Wikipedia article on the topic captures the gist pretty well:

In the fields of science, engineering, industry, and statistics, the accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity's actual (true) value. The precision of a measurement system, also called reproducibility or repeatability, is the degree to which repeated measurements under unchanged conditions show the same results. Although the two words reproducibility and repeatability can be synonymous in colloquial use, they are deliberately contrasted in the context of the scientific method.

See the Wikipedia article for the standard bullseye figures.

At the time (some nine years ago now), my friends and I would frequently get confused about which was which. Did precision relate to how close to the bullseye you were? Did accuracy relate to how closely grouped the points were? The names themselves didn't really clarify anything. Precise and accurate sound, to the teenage ear, indistinguishable.

While reading through Nate Silver's The Signal and the Noise, I had the (strikingly obvious) realization that these terms could easily be framed in probabilistic language¹. I don't think 10th-grade-me would have understood this framing. But that's his fault, not mine.

Here's the setup. Suppose we have some true value, \(\theta\), that we would like to measure. Maybe it's the speed of light in a vacuum, or the volume of water in our volumetric flask. Call our measurement of that value on a single trial \(\mathbf{M}\), and take \(\mathbf{M}\) to be given by

\[\mathbf{M} = \theta + \mathbf{N}.\]

Here, \(\mathbf{N}\) is 'noise.' For example, it might be noise due to measurement apparatus. Basically, \(\mathbf{N}\) will capture all of the things that could go wrong in the process of making the experimental measurement.

Let's choose to model \(\mathbf{N}\) in a simple way. I do this because it makes operationalizing precision and accuracy much easier. We'll assume that

\[\mathbf{N} \sim \text{Normal}(\mathbf{b}, \Sigma)\]

(Hold for all of the comments about why Normal theory is the devil.)

Okay, so we're assuming that the errors in our measurement are normal with mean \(\mathbf{b}\) and covariance \(\Sigma\). And now we're done. The precision of our measurement \(\mathbf{M}\) has to do with how 'small' \(\Sigma\) is. So that I can use the bullseye example, \(\mathbf{M}\) is a vector in \(\mathbb{R}^{2}\), so \(\Sigma\) is a \(2 \times 2\) matrix of covariances

\[(\Sigma)_{ij} = \text{Cov}(N_{i}, N_{j}).\]

This shows us that precision is a relative term: you can always be more or less precise, depending on how small you can make \(\text{Var}(N_{1})\) and \(\text{Var}(N_{2})\).

Accuracy, on the other hand, has to do with \(\mathbf{b}\), the suggestively named bias of our measurement process. If \(\mathbf{N} \sim \text{Normal}(\mathbf{b}, \Sigma)\), then \(\mathbf{M} \sim \text{Normal}(\theta + \mathbf{b}, \Sigma)\). In other words, \(\mathbf{M}\) picks up the bias in our measurement apparatus. No amount of averaging of the measurements will get us the right value. Sorry Law of Large Numbers, we'll always be off from \(\theta\) (on average) by \(\mathbf{b}\). Thus, accuracy is absolute (according to our model): we're either accurate, or we're not.

Now some pictures to clarify this result. We stick with the bullseye example, because it really is nice. As the Wikipedia article states, we can be:

precise and accurate, which means that \(\Sigma\) is small and \(\mathbf{b} = \mathbf{0}\)
precise and not accurate, which means that \(\Sigma\) is small and \(\mathbf{b} \neq \mathbf{0}\)
not precise and accurate, which means that \(\Sigma\) is large and \(\mathbf{b} = \mathbf{0}\)
not precise and not accurate, which means that \(\Sigma\) is large and \(\mathbf{b} \neq \mathbf{0}\)

For some reason, the standard textbooks (i.e. the textbooks I learned high school chemistry from) focus on only a few of these cases. All four cases, in the order above, are presented below.

So there we have it: a simple statistical operationalization of the (often hard to remember) terms precision and accuracy. Another favorite topic of mine from my days as a chemist-in-training are propagation of error computations. Those again become easy to deal with when we reframe the computation in terms of probability. Otherwise, they're a hodgepodge of non-obvious heuristics to look up in a lab manual².

Fun fact of the day: Galileo was one of the first scientists to propose averaging as a method of getting more correct (in the sense of accurate) measurements from a noisy scientific apparatus. Before him, a lot of scientists cherry-picked the outcomes that seemed most reasonable. I'm sure a lot of scientists still do.↩
This was more or less how they were treated in my college chemistry lab courses. Which is fair: there's barely enough time to teach chemistry without instilling a prerequisite knowledge of probability and statistics. Especially given the (surprising) distain for mathematics by many chemistry majors.↩

Markov the Gadfly

A slide from a talk by Brian Hayes on the history of Markov chains.

It amuses me to no end that parts of Markov's seminal work on dependent sequences were done to disprove speculations by Nikita Nekrasov on how probability theory can be used to 'prove' free will.

Earlier seminal work in probability theory related to similar theological / philosophical questions. For example, Pascal's famous, but misinformed, wager.

I suppose this speaks some to the usefulness of the ideas of theologians and philosophers: at least on occasion they lead others to deep mathematical insight, if only through aggravation.

Also, I now have even more respect for Andrey Markov.

But if we want to eradicate misunderstandings about statistics, I think it would help if we were more careful about how we choose our words.

Larry Wasserman, from Double Misunderstandings About p-values

Larry Wasserman is the patron saint of statistical education. (And a damn good researcher to boot!)

Mathematicians, Onward and Upward!

Can you remember the last time you saw in the newspaper a story that focused on the area under a curve, or the slope of a line? Yeah, neither can I. If you ever stumble upon such a story, then calculus might be your best friend.

To most students, however, statistics will be of far more use than calculus, especially if they select career paths that require them to analyze risk, evaluate trends, confront randomness, deal with uncertainty or extract patterns from data.

- Dave Gammon, from Statistically speaking, we don't use calculus much

This editorial brought a tear to my eye (in both good and bad ways). I'll leave deciphering the reasons why as an exercise for the reader.

Hint: reread the passage quoted above to realize the bad way. And keep reading the rest of the article. It just gets worse.

Gammon is a biology professor at Elon University, a "selective private liberal arts university in North Carolina renowned for engaged and experiential learning." (From Elon's website.)

And researchers like him are why mathematicians / computer scientists / physicists / (real) statisticians will soon colonize the life and social sciences.

Wolfram in Chains

The 100 year anniversary of Andrei Markov's presentation on Eugene Onegin to the Imperial Academy of Sciences in St. Petersburg has come and gone. (January 23 was the date.) It was during this presentation that he first proposed the mathematical model we now call, after him, a Markov chain. Markov chains represent a first go at introducing structure into a stochastic process. (The zeroth order go being a sequence of independent random variables.) Basically, if you have a sequence of random variables \(X_{1}, X_{2}, X_{3}, \ldots\), they form a Markov chain (of first order) if \(P(X_{t+1} | X_{t}, X_{t-1}, X_{t-2}, \ldots, X_{1}) = P(X_{t+1} | X_{t})\) for all \(t\). In words, the future \(X_{t+1}\) is independent of the past \(X_{t-1}, X_{t-2}, \ldots\) given the present \(X_{t}\) . Or, as a graphical model¹,

The other site I saw mention this was the Wolfram Blog. Which made me wonder:

If Wolfram had invented Markov chains, would he have patented them?

What do I mean? Stephen Wolfram patented one of the cellular automata he invented. And then sued a researcher that used that cellular automata in an academic paper.

Which begs the question: how much money could Markov have made off of his chains?

They show up in

speech recognition
weather prediction
gene annotation
Google's PageRank algorithm
the generation of fake papers

amongst many other places².

He'd certainly be rich.

Thankfully, he had better things to worry about:

By subscribing to them I can only weaken that which for me is particularly dear: the rigor of judgements I permit.

For example, it is very important to me to observe that Poisson did not prove his theorem; moreover, I cannot consider a statement a theorem unless it is established precisely.

-- from a letter by Andrei Markov (probabilist) to Alexander Chuprov (mathematical statistician), quoted in The Life and Work of A. A. Markov

Now you can see why these are called Markov chains.↩
Including, of interest to me, my own research.↩

The Problem with Statistics

I have had several discussions this week about statistics that have clarified my thinking about the field. The first started with a student I am working with this semester in a directed reading program. We plan to do an in-depth analysis of Nate Silver's prediction model for the 2008 and 2012 presidential elections¹.

During our discussion of statistics, I raised the point that I consider statisticians to fall into two classes. Those two classes, unsurprisingly, are those statisticians whom I like, and those whom I don't. At the time, I couldn't articulate the difference between the two types beyond that. I had a feeling in my gut. But a gut feeling does not a definition make.

I've since discussed this distinction with a few of my colleagues², and in the process of explaining myself to them, we've nailed down a better distinction.

The distinction is so simple, really, that I'm surprised it didn't immediately come to mind. But it's the fact that I had to make the distinction in the first place (and that it wasn't automatically made for me) that points out the 'problem with statistics.'

Here it is: in the common parlance, we call anyone who does statistics³ a statistician.

And we don't do that with many other professions.

Here are a few examples of professions where we don't fail to make that distinction:

Software Engineer != Computer Scientist

Mechanical Engineer != Physicist

Accountant != Mathematician

And yet anyone who looks at data in a quantitative way (especially if that is all they do) gets called a statistician.

How strange.

Fortunately (?) for the field of statistics, another name for the LHS sort of 'statistician' is coming into common use: data scientist. A data scientist is to a statistician what a software engineer is to a computer scientist.

Of course, none of these distinctions is clear cut. And we need both ends of every spectrum. But I don't think it hurts to distinguish between the nose-to-the-grindstone sort of professions and the more theoretical, ivory tower sorts.

Finally, in the vein of Jeff Foxworthy, a list of distinctions I would make between a data scientist and a statistician:

If you don't know that a random variable is a measurable function from the sample space to the outcome space, you might be a data scientist.

If you don't know how to program in R, you might be a statistician.

If you think that a confidence interval for a parameter makes a probability statement about the parameter, you might be a data scientist.

If you only care about the asymptotic distribution of an estimator, you might be a statistician.

If you think that rejecting the null hypothesis means that you accept the alternative hypothesis, you might be a data scientist.

If you don't know how to compute a least-squares estimator (in practice), you might be a statistician.

If you don't check that the residuals of a least-squares fit are white noise, you might be a data scientist.

If you don't know how to use HADOOP, you might be a statistician.

Okay, so my bit needs some work. But hopefully I've offended both sides of the spectrum enough that I won't be called biased!

Silver's model is something I've wanted to learn about for a while. I've read The Signal and the Noise, but Silver doesn't go into all that much detail there. Rightly so, considering the book (a) is for popular consumption and (b) covers many more topics besides election prediction.↩
A fancier sounding word for 'friends.'↩
Whether that be simple things like t-tests, linear regression, and ANOVA or advanced things like empirical process theory, non-parametric regression, and minimax bounds.↩

Reading a paper is like peeling an onion...

I've been reading a lot of papers lately. I should say I've been reading a lot of papers carefully lately. I usually read a lot of papers. Because what else is a guy to do on a nice winter evening?

So what's different this time? I'm preparing for my preliminary oral exam to progress to candidacy in the PhD program here at Maryland. For the 'exam' (as the name 'oral' indicates, it's really more of a presentation), we ('we' being people in the Applied Mathematics, Statistics, and Scientific Computation program) do a literature review of a particular topic, choosing one or two primary references and a smattering (approximately eight or so) secondary references. The goal is to 'demonstrate that [I have] a deep enough understanding to carry out research in a proposed area.' No big deal.

The area of research I've chosen to focus on is called computational mechanics. That deserves a post of its own someday soon. And a Wikipedia page. Because this is not what I mean¹. (Also, \(\epsilon\)-machines².) Rather, I mean the study of processes from a 'computation theoretic' framework. That is, study a given phenomena as something doing a computation. An obvious example of such a thing is a neuron. But other natural systems 'compute' in much the same.

And this framework turns out to be useful statistically, in terms of optimally predicting some stochastic process \(X(t)\).

But I'm getting carried away. The thing I wanted to comment on is the difference between 'deep reading' and just 'reading.' Everyone who has pursued a degree that involves 'working problems' (i.e. solving some sort of mathematically oriented problem set) knows that you can't read a math (or physics, or chemistry, or...) textbook the same way you read a novel. And yet I forget that sometimes. Just because I think I know something at a given instant of time doesn't mean that (a) I really understand the finer points or (b) the material has any chance of entering my long term memory.

I might post a page or two of my notes sometime, just for dramatic effect. I always loved the photos of science notebooks (like this) in SEED Magazine³.

I have a resolution in mind for the rest of the year: any paper I want to read and internalize, I won't allow myself to read without a pen in hand.

Computational mechanics has no textbook, very few pedagogical references, and no Wikipedia entry. What is does have is a very badass reputation as the optimal nonlinear predictive filter for any stochastic process. So, yeah, that's kind of cool. But at the same time, I wish that one of the fundamental building blocks didn't have the name \(\epsilon\)-machine.↩
Try Googling that, and you'll learn about machine epsilon. A super useful concept, but not at all related to computational mechanics. Fortunately the second entry is by Cosma Shalizi, though it's a little outdated. (I was eleven years old when the entry was last updated...)↩
Fun fact: the first issue of SEED I ever saw was in high school, when Kevin Orlando (my high school chemistry teacher and friend) had a copy lying around. The main topic for that issue: homosexuality in the animal kingdom.↩

Cliodynamics

I've learned a new word in the past week that has shown up in three (not quite) independent context. From Wikipedia:

Cliodynamics (etymologically from Clio, one of the nine muses (that of history), and dynamics, the study of temporally varying processes) is a new multidisciplinary area of research focused at mathematical modeling of historical dynamics.

I first encountered the word through a link to this journal by a friend. I then read about it in another link I followed through a friend (who is directly related to the original friend, thus the non-independence). Finally, I read about it in the 'Long Data' post by Sam Arbesman I mentioned last time.

What is cliodynamics? A field¹ focusing on 'theoretical historical social science.'

Which sounds fantastic. And interesting. But taking a look at a sample of articles from the journal I linked to, I commented to my friend:

Some of the authors seem to be academics, and others look like amateurs with Excel spreadsheets (and comcast.net email addresses...).

Not that Excel spreadsheets necessarily preclude good science. But they act as a good filter² indicating the people doing the science might not be using the most sophisticated methodology. As Aaron Clauset comments:

We implemented data methods from scratch because we don't trust the standard software packages and neither should you.

Or as I would modify it,

We implemented data methods from scratch because we want to understand the standard algorithms, and so should you.

Since I should be doing real³ research (possibly more on that later), Cliodynamics will have to go on the back burner for now.

Not 100% sure about its reputation as a field. Though it did get a mention in an opinion piece in Nature... (That piece has been added to my Instapaper roll.)↩
In the same way that someone using LaTeX doesn't necessarily indicate they're more mathematically sophisticated. But it's rare (though not impossible!) to come across someone who is mathematically sophisticated and doesn't use LaTeX.↩
That is, research related to my PhD thesis.↩

Long Data

As someone pointed out (commenting on one of my favorite algorithms), "hardly a day goes by without someone coming up to [me] and handing [me] a long discrete-valued time series and asking [me] to find a hidden Markov model for it."

Which brings me to a recent blog post by Samuel Arbesman (an academic uncle, since my advisor shares an advisor with him) about 'long data.' If you follow any of the same news sources as me, you've heard a lot of hype about 'big data,' and how it will revolutionize medicine, unlock human potential, and, according to some, make science obsolete ¹. Fortunately for people like me², the big data world is our playground.

Arbesman raises the point, however, that snapshots of big data are not enough. Long amounts of data are also necessary. That's something I face on a daily basis in my research: there's never enough data in time.

Why do we care about this? Well, if we assume the process we're studying is stationary³, then sometimes⁴ looking at a long enough time series from the process tells us all we could ever want to know. For example, we wouldn't gain anything more from looking at various other runs of the process: everything we need is in that one, sufficiently long time series. In math-speak,

\[ E[X(t)] = \lim_{T \to \infty} \frac{1}{T} \int_{t = 0}^{T} X(t) \, dt,\]

which is a fancy way of saying that ensemble averages and time averages are equivalent.

For non-stationary processes, I'm not sure how useful 'long data' would be. But we don't really know how to handle (interesting) non-stationary processes, so that's not too surprising.

Personally, I don't buy this. No matter how much data we have, we need models to make sense of it. Otherwise, the data is just a string of numbers.↩
With 'people like me' being people interested in statistical inference in complex systems.↩
Which, if you've never had a course in stochastic processes, probably doesn't quite mean what you think it does. Think of a stationary process as not changing too much over time.↩
In particular, when the process is ergodic.↩

More on Nate Silver

I originally posted this on Facebook. But why not here as well?

tl;dr: I complain about the use of 'Bayesian' in popular culture, and in the process rant (even more) about Nate Silver.

More people (a psychology professor and a computer science professors at NYU, none the less!) get the definition of 'Bayesian' wrong, this time in the New Yorker.

Marcus and Davis start off strong¹. And I understand that they're writing for New Yorker readers (presumably more mathematically sophisticated than your average American, but not amazingly so). But 'Bayesian' != 'Someone who uses Bayes's theorem'. Bayes's Theorem isn't even a theorem! It's a corollary of basic results about conditional probabilities.

A Bayesian is, in the loosest sense, someone who treats the parameters of a statistical model as if they are themselves random variables. A frequentist treats them as fixed (or, if you'd like, a trivial random variable where all of the probability mass lies on a single value). In practice, that's it.

Larry Wasserman (a card-carrying statistician, unlike Silver, Marcus, Davis, or, well, myself) explains.

The Signal and the Noise is a fantastic book. If anything, it shows how far you can get using statistics without a firm understanding of its foundations (or lack there of).

And finish strong, for that matter. I should really give them more credit for their arguments. Andrew Gelman, another card carrying statistician (and a Bayesian to boot!), seems to mostly agree with Marcus and Davis. Which makes me wonder what I'm missing in the definition of 'Bayesian.'↩

Flipping Coins

Here's an interview question.

Suppose you have a biased coin. Assume that the coin comes up tails 1/3 of the time, and heads 2/3 of the time. Question: how can you use this biased coin to simulate a fair coin?

I'll walk you through my thought process, and then give the 'right' answer (I don't know if there is a best answer).

First, let's formalize the problem (because for some reason I have trouble wrapping my head around some problems unless I first over-formalize them). Let \(X\) be a Bernoulli random variable taking the values 0 (for heads) and 1 (for tails). Since we have a biased coin, \(P(X = 0) = 1/3\) and \(P(X = 1) = 1 - P(X = 0) = 2/3.\) Our goal is to do something with \(X\) that gives us a random variable \(U\) such that \(P(U = 0) = P(U = 1) = 1/2.\)

My first thought was to apply the cumulative distribution function to \(X\). This method works for continuous random variables. Let \(X\) be a continuous random variable with cumulative distribution function \(F_{X}(x)\). Then the new variable \(U = F_{X}(X)\) (that is, the random variable you get by applying the CDF of \(X\) to itself) is uniform on \([0, 1]\). (I'll leave the proof of this to the reader. Basically, consider \(P(U \leq u)\), substitute for \(U\), and then notice some nice things about \(F_{X}(x)\) in the continuous case.) But as I said, this applies to continuous random variables.

The next most obvious approach is to flip the coin more than once. If we assume independence between flips, then for two flips \(P(X_{1} = x_{1}, X_{2} = x_{2}) = P(X_{1} = x_{1}) P(X_{2} = x_{2})\). We can spell out the associated joint probability mass function, but you'll basically see that the sequences and their associated probability mass are

00 -- 1/9

01 -- 2/9

10 -- 2/9

11 -- 4/9

(Important note: the sum of all of these is 1.)

At this point, an 'obvious' (though it didn't immediately jump out at me) solution to our problem presents itself. Notice that the probabilities of getting a 01 and a 10 are the same. The leap we then have to make is to throw out some tosses. If we just ignore the 00 and 11 tosses (pretend they don't happen, if you'd like), then we can call 01 heads and 10 tails, and we have ourselves a fair coin.

Of course, we have to flip our biased coin more frequently to generate a sequence of 0s and 1s we could generate with an unbiased coin: we have to flip twice to get an outcome, and then we throw out 5/9 of those outcomes! Which leaves me wondering if a more efficient method exists.

One nice property of this method (beyond the fact that it solves the interview question and hopefully gets us the job¹) is that it will work for any biased coin. In fact, this method is important enough that it has been named the von Neumann extractor: given any uncorrelated (no need for independence) sequence of 0s and 1s generated by a (stationary) stochastic process, this method will produce a fair coin flip.

Good on von Neumann.

Aside: I think I'll start collecting these sorts of questions on this blog. Presumably someone else has had this idea before me. (A quick google search reveals yes, they have.) But it can't hurt for another person to do it.

It's worth noting that this question did not come from an interview I did. I've only done one interview in my life. In it, I expounded on the virtues of multidimensional Newton's method, and learned that I should know how to talk intelligently about pointers and binary trees. (This was during my last semester of undergrad, and my only computer science course at that point had used Java, which was conspicuously pointer free!)↩