DNA Sequences as a Stochastic Process

Occasionally I'll write a response to a question on Quora. Partly as a means to procrastinate, partly for educational purposes, and partly to 'show off.'

Here was my answer to the question Can DNA sequences be treated as timeseries?

A DNA sequence (ACGTACT..., for example) is a collection of bases that have a definite spatial order (the A really does precede the C, which really does precede the G, etc., in the DNA molecule, at least if we read the molecule linearly, from 5' to 3', for instance). Since the indexing matters, we have found ourselves in the realm of stochastic processes. A stochastic process (at least from one perspective) is an indexed collection of random variables. The simplest stochastic process that most scientists are familiar with is a random sample, the so-called IID (independent and identically distributed) stochastic process where each instance in the sequence is independent of each other instance, and all of the sequences have the same distribution. This is the bread and butter of introductory statistics courses (mostly because assumptions of IIDness make problems tractable). But nature need not behave according to our theories from STAT100.

A general stochastic process allows for arbitrary dependencies between all of the random variables. For instance, it might be the case that an A is more likely to follow a G (I don't know the biology, but I have seen evidence that this sort of thing does happen in real genomes), or a T to follow a C. If we incorporate this sort of information into our model of the DNA sequence, we have something called a first-order Markov chain (the probability of observing a base X only depends on the previous base observed, and is independent of all the bases before that). We can extend this to general nth order Markov chains, where the probability of observing a particular base only depends on the previous b bases (b = 1 gives us the first-order chain).

Another common stochastic process model used for DNA sequences is the Hidden Markov Model (often abbreviated HMM). In this case, we have a joint process, a symbol sequence that we do observe (in this case, the bases), and an unobserved 'state' sequence that we do not observe (but would like to infer). This is a common technique for identifying genes within a DNA sequence, where the hidden state transitions between 'gene' and 'not-gene'.

These ideas as applied to DNA sequences are old, dating back at least until the 1980s. See for example Gary Churchill's Stochastic models for heterogeneous DNA sequences (the first Google result for 'stochastic process DNA sequence').

Bringing things back around to the original question, a time series is just a particular type of stochastic process, where the index is over time. A DNA sequence is a very similar object, where the index is over space. In both cases, we are usually interested in the dependencies between nearby (and distant) observations, in time for time series and in space for DNA sequences. As such, the tools of one field can often be applied to the other. However, the tools of time series analysis usually involve real-valued observations (the closing price of the stock market on consecutive days, the weight of a patient over some period of time, etc.), whereas the observations of DNA sequences involve discrete observations from a fixed alphabet {A, G, C, T, U}. As such, we should be careful about any attempt to transplant tools directly from time series analysis to DNA sequence analysis.