Tweets, Metadata, and Other Controversial Topics — And a New Paper
With all of the recent activity surrounding Edward Snowden, 'metadata' has been in the news. For those who have been living under a rock for the several months, metadata, as the name implies, is data about data. For example, if you make a phone call to a friend, the conversation itself is the data, and the metadata is information about the two phones involved in the call, how long the call lasted, etc.
I didn't realize until the Snowden story broke that I've recently been very much in the business of metadata (though hopefully in less nefarious ways). I recently presented a paper at SocialCom 2013 that covers this work. The main paper can be found on the arXiv.
I thought to write about my research after reading this piece from Rolling Stone about Dzhokhar Tsarnaev (or Jahar, as his friends called him), the living brother implicated in the Boston Marathon bombing this past April. In the article, Janet Reitman references Jahar's Twitter account various times. It turns out his Twitter account is still online. In fact, he tweeted up until the day of the bombing, and then beyond it.
As an exercise in learning tweepy, a Python tool for interfacing with the Twitter API, I pulled down about a year of tweets from Jahar. Most of them seem typical of an American teenager:
2012-04-11 03:47:20: dam im one lazy ass dude #whatsthemotive
2012-04-12 07:00:30: protein milk shake yeaa we getting #big out here
Some point to Jahar's unique past:
2012-04-29 22:00:12: proud to be from #chechnya
2012-08-03 05:45:10: chechen dudes holding down russia with all of their gold medals
And some are much more haunting:
2013-04-16 00:04:50: Ain't no love in the heart of the city, stay safe people
All of this is, certainly, informative about Jahar as a person. In my recent work, I throw all of this extra information out and focus only on the timestamp for each tweet. From this perspective, a person tweeting is no different than a neuron firing, so we can visualize this using a rastergram1:
Each row in the plot corresponds to a single day, and each row plays out between 7am and 10pm2. In a particular row, a vertical bar occurs if Jahar tweeted during that second.
A rastergram is a useful tool for exploring point processes, which in this context means a collection of events occurring at (possibly random) times. We've already seen the simplest example of a point process on this blog: the Poisson process. For a Twitter user, this would mean that the person is / appears to be3 tweeting at random: at each second, they flip a coin with some bias and only tweet if the coin comes up '1'4.
If this turns out to be the case, we can't do much in terms of prediction for that user. In particular, their past behavior tells us nothing about how they'll behave in the future. Instead of trying to build a complicated model of their dynamics, we're better of guessing they'll do what they've already done: we count up the number of times they tweeted and the number of times they didn't, and predict whichever wins the majority vote. Not an interesting predictor, but the best possible one in this case.
If the user does have a discernible pattern in their behavior, then we can learn this pattern by building a model of their past dynamics. We built two types of models: an \(\epsilon\)-machine / causal state model and an echo state network model. The first is inspired by results from probability theory about the optimal predictive representation of conditionally stationary stochastic processes. The second comes out of ideas from artificial neural networks on how to deal with sequential data.
I'll leave the details for the paper, but suffice it to say that by building models of user behavior, we were able to do much better than the simple majority vote predictor. This is perhaps not surprising: people don't tweet at random. What was surprising was (1) how similar the results we got from the causal state models and the echo state networks and (2) how well we did without incorporating any explicit social information in the models5. We're still not sure about the first point, but have explored some different ideas. On the second point, we're moving forward in incorporating inputs into our model in the natural ways, and should hopefully begin to have results towards the end of this semester.
This paper was a lot of fun to work on. It represents the culmination of the first project I helped choose a direction for, performed a lot of the groundwork on, and generated a lot of the paper text for. Excitement for these sorts of things (I published!) will presumably diminish as I publish more. But I'll take this chance to bask in a (minor) victory.
I first learned about rastergrams in a computational neuroscience course I took with Dan Butts. They're a handy way to visualize point processes across time and across trials. Which, when we remember that the spikes of neurons can be modeled as point processes, make their appearance in neuroscience quite clear.↩
Why? Because these are the hours I could imagine myself being active on Twitter.↩
Why 'appears to be'? Just because something looks random to our eyes (or our methods) doesn't mean the process is random, in any statistical sense. As a simple example, imagine a black box (transducer) that takes as its input a random stream of bits, and outputs the stream with a time delay. Given the input to the black box, it behaves completely deterministically. But our lack of knowledge about the system makes the black box look random.↩
We treated each user as a process with self-feedback. In essence, an autoregressive model.↩