Long Data

As someone pointed out (commenting on one of my favorite algorithms), "hardly a day goes by without someone coming up to [me] and handing [me] a long discrete-valued time series and asking [me] to find a hidden Markov model for it."

Which brings me to a recent blog post by Samuel Arbesman (an academic uncle, since my advisor shares an advisor with him) about 'long data.' If you follow any of the same news sources as me, you've heard a lot of hype about 'big data,' and how it will revolutionize medicine, unlock human potential, and, according to some, make science obsolete ¹. Fortunately for people like me², the big data world is our playground.

Arbesman raises the point, however, that snapshots of big data are not enough. Long amounts of data are also necessary. That's something I face on a daily basis in my research: there's never enough data in time.

Why do we care about this? Well, if we assume the process we're studying is stationary³, then sometimes⁴ looking at a long enough time series from the process tells us all we could ever want to know. For example, we wouldn't gain anything more from looking at various other runs of the process: everything we need is in that one, sufficiently long time series. In math-speak,

\[ E[X(t)] = \lim_{T \to \infty} \frac{1}{T} \int_{t = 0}^{T} X(t) \, dt,\]

which is a fancy way of saying that ensemble averages and time averages are equivalent.

For non-stationary processes, I'm not sure how useful 'long data' would be. But we don't really know how to handle (interesting) non-stationary processes, so that's not too surprising.

Personally, I don't buy this. No matter how much data we have, we need models to make sense of it. Otherwise, the data is just a string of numbers.↩
With 'people like me' being people interested in statistical inference in complex systems.↩
Which, if you've never had a course in stochastic processes, probably doesn't quite mean what you think it does. Think of a stationary process as not changing too much over time.↩
In particular, when the process is ergodic.↩