How to do real time, high volume data analysis on Twitter
There has been an increasing amount of chatter about the idea of social TV. I have spoken and written about it before arguing that Social TV, like most technology triggers, would go through the typical curve of innovation, and in the end a few dominant players would emerge.
A key requirement for becoming a dominant player will be measurability. Consistent measurability requires, as a starting point, accurate data collection.
When we began looking at Twitter traffic around TV shows we only wanted to count the number of tweets per show, to measure their relative popularity.
The first approach: Using the Twitter Search API
We initially achieved this by sampling the Twitter search API using a manually compiled short list of hashtags related to each TV show. It became quickly apparent this was NOT the best solution.
The results for the Twitter search API are limited to 1500 results per call. This limit meant that we had to approximate a tweet rate based on the time of each tweet. It meant that the accuracy of the tweet volume per show would be directly related to the sampling rate we used because the 1500 tweets per second limit is easily reached!
This is like measuring the speed of a car in a speed trap. Cars race down the road only to slow down briefly for a speed camera. Assuming the speed at the speed trap is representative of the whole ride is probably wrong. When measuring tweet-rates we occasionally caught very high rate spikes. Because we were sampling every 15 minutes it meant the high rates were adjusted too late: the resulting interpolation is rough.
The second approach: using bolts
Our solution was to use multiple bolts (small programs) to count the tweets in parallel. To do it we used one of Twitter's own projects: Storm and a healthy dose of AWS services: (DynamoDB and EC2). Using Storm, we deploy a topology of different workers to count and record tweets directly from the Firehose. The listeners are distributed over a flexible number of cloud instances and can react to large amounts of volume in real-time. The net effect is that we count each tweet, even in peak conditions as opposed to relying on estimated tweet rates.
Using bolts is the first step toward reliable measurability for social TV data on Twitter. However, understanding how much people tweet about a show is still a very rough metric..
With the ability to distribute computation in a rigorous way (the Storm project guarantees at least once message delivery), we are now interested in looking at the tweet content in a lot more depth: what was said? Were there positive or negative connotations? Is there a clear demographic split?
The best way to make decisions with this data remains to be seen. However, I could easily see it changing the way TV is made, or aiding business critical decisions.
Guillermo Christen, Head of Product Development – Content Discovery