Silver at the Oscars

As part of the Directed Reading Program I'm doing with a student at Maryland, we've decided to dig into Nate Silver's Oscar Predictions for this year.

'Fortunately' for us, Silver only gives us a glimpse of his methodology. He also doesn't supply us with the data he used. I've compiled that data here. In short, here is his approach:

The gory details: I weight each award based on the square of its historical success rate, and then double the score for awards whose voting memberships overlap significantly with the Academy.

I don't know how a weighted average counts as 'gory,' but I guess Silver has a different audience in mind. Let's unpack this methodology, in the hopes that "by relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems."¹

Let \(O^{(j)}\) be the Oscar winner in the \(j^{\text{th}}\) year, \(j = 1, \ldots, 25\). Let \(C_{i}^{(j)}\) be the award winner from the \(i^{\text{th}}\) ceremony, \(i = 1, \ldots, 16\)². Let \(S^{(j)}\) be the set of candidate Oscar winners on the \(j^{\text{th}}\) year. For example, this year we had \[S^{(26)} = \{\text{Amour, Argo, Beasts of the Southern Wild, Django Unchained,}\]

\[\text{Les Miserables, Life of Pi, Lincoln, Silver Linings Playbook, Zero Dark Thirty} \}.\] Thus, we have nine films to predict from, and \(|S^{(26)}| = 9.\) From the description above, Silver wants to construct a function from \(S^{(26)}\) to the reals such that

\[ f(s) = \sum_{i = 1}^{16} \beta_{i} I[C_{i}^{(26)} = s], s \in S^{(26)}.\]

Here \(I[\cdot]\) is the indicator function, which is 1 if its argument is true, and 0 otherwise. For example, if \(s\) = Argo and \(C_{1}^{(26)} =\) Argo, then \(I[C_{1}^{(26)} = s] = 1.\) But if \(s\) = Argo and \(C_{2}^{(26)} =\) Amour, then \(I[C_{2}^{(26)} = s] = 0\). We then choose the film \(s\) that gets the highest weighted vote,

\[ \hat{s} = \arg \max_{s \in S^{(26)}} f(s).\]

Let's consider a few choices of \(\beta_{i}\) and see what sort of prediction they give us. If we take \(\beta_{i} = 1\) for all \(i\), our predictor reduces to

\[f(s) = \sum_{i = 1}^{16} I[C_{i}^{(26)} = s], s \in S^{(26)},\]

which, after arg max-ing, is simply the majority vote. That is, we'll choose the film that got the most awards from the non-Oscar awards that year. This seems like a reasonable enough predictor in the situation where we have no other information about how the the non-Oscar awards relate to the Oscar awarded on a given year. But we do have such information. It's summarized in Nate Silver's table and captured in this tab-delimited file. We can (hopefully) use the past performance of the non-Oscar awards and the Oscars to predict how the Oscars will turn out this year. Loosely, we can try to pick up on correlations between the non-Oscar awards and the Oscars that might be predictively useful. This is exactly what Silver ends up doing, in an ad hoc fashion.

Using the historical data, Silver computes, for each award \(i\), the fraction of years that award and the Oscars agreed on their winning films. Using our handy-dandy indicator functions again, we can write this value \(\hat{p}_{i}\) as

\[\hat{p}_{i} = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} I[O^{(j)} = C_{i}^{(j)}]\]

where \(n_{i}\) is the number of years we have data for award ceremony \(i\), since not all of the award ceremonies go back to 1987. In fact, only ten of the fourteen I've included do. With the \(\hat{p}_{i}\) values in hand for all of the sixteen non-Oscar awards, Silver computes his new weighted vote³ as

\[ f(s) = \sum_{i = 1}^{16} \hat{p}_{i}^2 I[C_{i}^{(26)} = s], s \in S^{(26)}.\]

How does this compare to the 'majority vote' predictor above? Now we account for the number of votes given to each of the candidate films, and how those votes match up with Oscars in the past. In other words, if a lot of the non-Oscar awards vote for a film, but those awards also haven't matched the Oscars winner historically, we'll discount their votes.

So far we've seen two possible choices for the weights \(\beta_{i}\). The first choice for the weights, namely \(\beta_{i} = 1\) for all \(i\), work well in the situation where we have no historical information. The second choice, namely \(\beta_{i} = \hat{p}_{i}^{2}\) seems reasonable enough when all we know is whether a non-Oscar award agreed or didn't with the Oscars in the past. But why the square? And can we choose a different weighting that gets us more standard predictor from the statistical learning literature⁴?

Stay tuned to find out!

A new favorite quote by Alfred Whitehead.↩
Note that Nate Silver uses 16 awards instead of my 14. I removed the American Cinema Editors Award for Best Comedy or Musical and the Golden Globe for Best Musical or Comedy, since they didn't seem to have much to do with the Best Film, which is usually a Drama.↩
Well, not quite yet. Silver also doubles the weight for 'insider' awards. These are awards that have a considerable overlap with the Oscars in terms of their members. So, add that doubling to the model to get back Silver's complete predictive model.↩
My current gut feeling: this feels a lot like logistic regression. We're basically doing a multi-class classification problem. But the fact that the set of films changes every year complicates things a bit.↩