Mind the Gap

Larry Wasserman recently had a post about the gap between the development of a statistical method, and the justification of that method. He gives timelines for various methods, from the standard (like Maximum Likelihood Estimation) to the now-becoming-standard (like Causal Inference), demonstrating the large amount of time that had to pass between when inventors propose a method and when researchers (sometimes the same as the inventors!) prove results about why that method 'makes sense.'

I suppose, for the non-mathematically inclined reader, that I have to explain a little what I mean by 'makes sense.' Any method can be proposed to solve a problem (statistical or otherwise). But for statistical problems, we (sometimes) have the tools to answer whether or not we'll get the right answer, most of the time, with a given method.

As an illustrative example, consider the goal of determining the exponent of a power law model for the distribution of some empirical phenomena. This is a favorite topic of a lot of complex system scientists, especially physicists. For those with no training in statistics (beyond what they get in their undergraduate lab courses), a logical approach might be computing the empirical distribution function of the data, taking the logarithm of that function, and fitting a line through it¹. If the data are power law distributed, then the true² cumulative distribution function of the data will be a power law, and thus the log cumulative distribution function will be linear. Taking the slope of the line of best fit would give an affine transformation of the exponent, and our job is done.

Except that we don't have the true cumulative distribution function. Instead, we have the empirical distribution function, which is itself a stochastic process with some underlying distribution. In the limit of infinite data, the empirical distribution will converge to the true distribution function almost surely. But in the meantime, we don't have many guarantees about it, and we certainly don't know that it will give us a good estimate of the exponent of the power law.

Instead, we should use Maximum Likelihood Estimation, as proposed by Ronald Fisher (and perhaps first used by Carl Friedrich Gauss in some computations about celestial orbits). This method is standard statistical fair³, and its application to power laws is well documented. And unlike the line-of-best-fit-through-the-empirical-distribution-function method, it has provably nice properties. And yet people still continue to use the line-of-best-fit method. Why? Because it makes intuitive sense. But it doesn't 'make sense,' statistically.

All of that to explain the (research) nightmares that keep me up at night: using (or worse, proposing!) a method that doesn't 'make sense.' Which is why Larry Wasserman's post is so consoling. Perhaps I don't immediately know why a method I would like to use makes sense. But hopefully in the future, someone far smarter than me will derive the justification for my method.

To quote Cosma Shalizi, from Section 11.4.3 of his thesis, where he proposes using computational mechanics for continuous-valued, discrete-time stochastic process,

It becomes a problem of functional analysis, so the mathematical foundations, if we aim at keeping to even the present standard of rigor, will get much more complicated. Still, we might invoke the physicist's license to ignore foundations, on the grounds that if it works, the mathematicians will find a way of making it right.

For those interested, he has since made this leap. See his and his graduate student Georg Georg's work on LICORS.

Wasserman's post was prescient. Rob Tibshirani, one of the co-inventors of the LASSO (an \(\ell_{1}\)-regularized least squares method), recently developed a significance test for the LASSO (one of the methods listed in Wasserman's 'The Gap' post). This provides even more (statistically justified!) tools for using the LASSO.

Which means that the method now 'makes sense' even more.

Admittedly, this is the favorite approach of a lot of people, not just physicists. But for a group supposedly interested in nonlinear science, the use of a linear method seems suspect.↩
Okay, 'true' here may be a bit too strong of a word. Let's say 'a good model of the data generating process.'↩
I learned about it in my undergraduate mathematical statistics course.↩