The End of Bayes?

I'll start with an excerpt by Cosma Shalizi:

This raises the possibility that Bayesian inference will become computationally infeasible again in the near future, not because our computers have regressed but because the size and complexity of interesting data sets will have rendered Monte Carlo infeasible. Bayesian data analysis would then have been a transient historical episode, belonging to the period when a desktop machine could hold a typical data set in memory and thrash through it a million times in a weekend.

— Cosma Shalizi, from here

Actually, you should leave here and read his entire post. Because, as is typical of Cosma, he offers a lot of insight into the history and practice of statistics.

For those in the know about Bayesian statistics¹, the ultimate object of study is the posterior distribution of the parameter of interest, conditioned on the data. That is, we go into some experiment with our own prejudices about what value the parameter should take (by way of a prior distribution on the parameter), do the experiment and collect some data (which gives us our likelihood), and then update our prejudice about the parameter by combining the prior and likelihood intelligently (which gives us our posterior distribution). This is not magic. It is a basic application of basic facts from probability theory. The (almost implicit) change of perspective that makes this sort of analysis Bayesian, rather than frequentist, is that we're treating the parameter as a (non-trivial) random variable: we pretend it's random, as a way to encode our subjective belief about the value it takes on. In math (to scare away the weak spirited),

\[ f(\theta | \mathbf{X}) = \frac{f(\mathbf{X} | \theta) f(\theta)}{\int f(\mathbf{X} | \theta') f(\theta') \, d \theta'}.\]

As Cosma points out in his post, the hard part of Bayesian statistics is deriving the posterior. Since this is also the thing we're most interested in, that is problematic. The strange thing about dealing in Bayesian statistics is that we rarely ever know the posterior exactly. Instead, we use (frequentist) methods to generate a large enough sample from something close enough to the actual posterior distribution that we can pretend we have the posterior distribution at our disposal. (Isn't Bayesian statistics weird²?) This is why Bayesians talk a lot about MCMC (Markov Chain Monte Carlo), Gibbs Sampling, and the like: they need a way to get at the posterior distribution, and pencil + paper will not cut it.

It's one thing to read about how these methods can fail in theory, but another to see it in practice. In particular, in the Research Interaction Team I attend on penalized regression, we recently read this paper on a Bayesian version of the LASSO³. In the paper, the authors never apply their method to the case where \(p\) (the number of dimensions) is much greater than \(n\) (the number of samples). Which is odd, since the frequentist LASSO really shines in the \(p \gg n\) regime.

But then I thought about the implementation of the Bayesian LASSO, and the authors' 'neglect' became clear. To get a decent idea of the posterior distribution of the parameters of the Bayesian LASSO, the authors suggested sampling 100,000 times from the posterior distribution. But for each sample, they have to solve, in essence, a \(p\) dimensional linear regression problem. For \(p\) very large, this is a costly thing to do, even with the best of software. And they have to do this ten thousand times. Compare this to using coordinate descent for solving the frequentist LASSO, which is very fast, even for large \(p\). And we only have to do this once.

In other words, we've hit up against the wall that Cosma alluded to in his post. The Bayesian LASSO is a clever idea, and its creators show in their paper that it outperforms the frequentist LASSO in the \(p \approx n\) case⁴. But this case is becoming less and less the norm as we move to higher and higher dimensional problems. And MCMC isn't going to cut it.

If you're playing along at home, Bayesian statistics is this, and not this or this.↩
To be fair, in frequentist statistics, we pretend that our data are generated from an infinite population, and that our claims hold only over infinite repetitions of some experiment. I suppose it's all about what sort of weird you feel most comfortable with.↩
Somehow, this wasn't the first link to appear when I typed 'LASSO' into Google. Sometimes I have to remind myself that the real world exists.↩
Though they show performance based on simulations over ensembles of examples where \(\boldsymbol{\beta}\) is fixed and randomness is introduced via the experiment. That is, they show its effectiveness in the frequentist case. Again, statistics is a weird discipline...↩