The Building Blocks of Probabilistic Models
I had the realization this past semester that conditional distributions are the building blocks of probability models. This won't come as a surprise to anyone who has studied graphical models1. Graphical models are a way of capturing conditional independence relationships, and they force you to think about how the joint factors in terms of the conditional distributions.
Of course, a joint distribution \(f(x_{1}, \ldots, x_{n})\) always factors as \[ \prod_{i = 1}^{n} f(x_{i} | x_{i-1}, \ldots, x_{1}) = f(x_{n} | x_{n-1}, \ldots, x_{1}) \cdot \ldots \cdot f(x_{1}).\] This is called the chain rule, which immediately makes me think of the chain rule from calculus. They're completely unrelated, of course. This version is a 'chain' rule because you keep factoring the marginal distributions using the definitions of conditional probability until you reach the final marginal \(f(x_{1})\).
We didn't have to apply the chain rule this way. We could have also correctly written \[ f(x_{1}, \ldots, x_{n}) = f(x_{1} | x_{2}, \ldots, x_{n}) \cdot \ldots \cdot f(x_{n}).\] Often, there isn't anything telling us how we should do the factoring. We usually do so in a way that makes our lives easier. In the case of graphical models, the graph tells us how to factor most efficiently.
So, we see that conditional distributions give us a way to build up joint distributions. Thus the sense that they are the 'building blocks' of probabilistic models. But what does this matter?
Here's an example. A while back (about this time, a year ago), I was working on a project having to do with the methylation of DNA in cancerous and non-cancerous tissues. Methylation tends to occur on cysteines occurring in CpGs along the DNA molecule. We abstracted the problem to the point that we asked whether adjacent CpGs tended to methylate together.
To answer this question, for a given subject, we had many, many reads from the subjects DNA, and wanted to know if there was a high rate of variability at a given site within a single subject. Thus, for two adjacent CpG sites, we really have a bunch of independent samples \((M^{1}_{1}, M^{2}_{1}), (M^{1}_{2}, M^{2}_{2}), \ldots, (M^{1}_{n}, M^{2}_{n})\) where \(M^{j}\) is a Bernoulli random variable taking a value 1 if site \(j\) is methylated and 0 otherwise and \((M^{1}_{i}, M^{2}_{i})\) comes from a single read \(i\).
At the time, we had trouble figuring out what the joint distribution of \((M^{1}, M^{2})\) should be. I remember a frantic phone call with my partner where I realized that using correlation or even mutual information wouldn't work to figure out of adjacent sites tended to methylate together or not. Thinking back on it now, I'm pretty sure my fears were nonsense. Let's see why.
We want to build up a model for \((M^{1}, M^{2})\). That means we have to take the joint distribution \(f(m_{1}, m_{2})\) and factor it. In this very simple case, it doesn't really matter how we do that. Let's use \[f(m_{1}, m_{2}) = f(m_{1}) f(m_{2} | m_{1}).\] Okay, we need to specify two pieces, then. Both \(f(m_{1})\) and \(f(m_{2} | m_{1})\). The first one is easy. We're modeling the random variable \(M_{1}\) as Bernoulli, so we must have that \(M_{1}\) has probability mass function \[ f(m_{1}) = p_{1}^{m_{1}}(1 - p_{1})^{m_{1}},\] where \(m_{1}\) takes either the value 1 (methylated) or 0 (unmethylated).
Now comes the more subtle part. We need to specify the conditional distribution of \(M_{2}\), given that we observe a particular value of \(M_{1}\). This captures any sense of structure between \(M_{1}\) and \(M_{2}\). For example, if they were independent, then \(f(m_{2} | m_{1}) = f(m_{2})\), and we'd just have to specify \(M_{2}\) as a Bernoulli with success rate independent of \(M_{1}\). But since we don't want to assume that, we should incorporate the dependence somehow. The simplest way is by using the logistic function. This logistic function is nice, because it maps any real number to the interval \([0, 1]\), where probabilities have to live. We know that \(M_{2} | M_{1} = m_{1}\) still has to be Bernoulli, since it only takes two values. Thus, we'll say that \[M_{2} | M_{1} = m_{1} \sim \text{Bernoulli}\left( \frac{1}{1 + e^{-(\beta_{0} + \beta_{1} m_{1})}} \right).\] \(\beta_{0}\) and \(\beta_{1}\) control how correlated \(M_{1}\) and \(M_{2}\) are. We can make it so that when \(M_{1} = 1\), \(M_{2}\) is always \(1\) by taking \(\beta_{1}\) large and positive. Similarly, we can make it so that when \(M_{1} = 1\), \(M_{2}\) is always 0 by taking \(\beta_{1}\) to be a very large negative number2.
Now we can write down the joint distribution of \((M_{1}, M_{2})\) by patching together the marginal of \(M_{1}\) and the conditional of \(M_{2}\) given \(M_{1}\), and compute anything we want. I used this model recently to validate some mutual information code. If we take \(p_{1} = 1/2\), \(\beta_{0} = 0\), and \(\beta_{1} = -1\), we get that the mutual information between \(M_{1}\) and \(M_{2}\) is about 0.04 and the correlation between them is about 0.057. Neither of which, of course, is 0, as I mistakenly believed last year.
- Don't even get me started on those who call directed graphical models 'Bayesian networks'. Wikipedia, I'm ashamed of you.↩ 
- This should look a lot like the model behind logistic regression. Basically, it is.↩