Why Interpolation Isn't Enough

I wrote this as a response to this post titled Is predictive modeling different from interpolation? Do we really need stats?

My answer to both questions is a resounding yes.


In one word: generalization.

In many more words:

When you say 'interpolation' of the data, I assume you don't mean threading a function through all of the data points, but rather doing some sort of least-squares fitting over intervals (e.g. using splines a la Hastie, Tibshirani, and Friedman's chapter 5).

The goal of statistical modeling is to distinguish between what's there by coincidence (noise) and what's there because of something inherent to the object of study (signal). Usually, we care about the signal, but not the noise. Without statistics, you're modeling noise + signal, and the noise doesn't generalize. If we didn't think there was noise (coincidences) in the data, then you're correct. We might as well throw out our probabilistic model and use approximation theory, maybe applying an approach like Nick Trefethen's chebfun:

http://www.maths.ox.ac.uk/chebfun/

But the noise doesn't generalize, so the interpolated function will get us further from the 'truth,' on average, then using a regression-based method.

How to build 'model free' regression methods is an open question. See, for example, section 2.2 from this paper:

http://www.rmm-journal.de/downloads/Article_Wasserman.pdf

But even here, we don't assume that the data are right.

If the world weren't noisy, we could just use interpolation and throw out the stochastic considerations inherent in statistical models. But large parts of the world are noisy.