Thursday, 15 December 2011

Twitter Streaming API - Almost Useful

Twitter's Streaming API is a splendid idea. It gives developer's access to a good splodge of Tweets and let's us filter them in different ways. 

However it has a few flaws.

For Twitter, the benefit of a Streaming API is probably one of scalability. Instead of us using the old REST API to ask for specific data and causing tens of thousands of data look ups, all they do is give us the end of their own data stream once it has been used in house and is now halfway across the back garden. All they have to do is allow us to filter the stream a bit to make it a bit more relevant to our needs and put an absolute cap on the throughput (about 1% for most of us.)

This looks good. 1% is enough for most development needs and streams down your connection like a low bandwidth radio station. I don't really know what the bandwidth or download is, but it's not much. Once we've developed our new and wonderful website, then we can ask, or possibly pay, Twitter to turn up the pressure a bit.

So, now let's look at the filters.

There are several ways that the stream can be filtered
  • follow - filter by userid
  • track - filter by keyword
  • location - filter by geographic location
  • retweets - just the retweets ma'am
  • links - only tweets containing a link
  • random - I think they just mean unfiltered
On the face of it this is pretty useful. You could filter the stream by adding the location of your home town so that you can get the jist of what is important to your townsfolk. You could filter the stream by the keyword 'Elvis' to see who has spotted Elvis lately and plot the results on a map.

My own first idea was inspired by the M5 motorway accident just a few miles from where I live and astounded that even in this day and age, the scale of the incident was only uncovered somewhat slowly. Surely what the quantity and content of the tweets from the people who were NOT in the incident itself, would help scale the incident? So what I wanted to do was:
  1. Listen to what people Tweet at known traffic jam locations.
  2. Identify some fingerprint of common words, maybe "traffic, jam, standstill, miles" or whatever.
  3. Look for clusters of these words near to motorways.
  4. Plot the clusters based on the location of the phones that made the tweets.
So I get an updating list of current traffic incidents from the Traffic England RSS feed and start listening within 5 kms of the stated location using a bounding box centred on the location. The first thing I notice is that some of the tweets are in the bounding box, some are around it and some, quite a few are far away, sometimes 200km away or more!

I don't know how Twitter do the filtering, but evidently it's based on something fairly broad brushed. I can live with that maybe, all I have to do is check the Tweets geo location, which is added if you tweet by most modern phones. I was expecting most of the useful tweets to be from a mobile anyway, so that would work if I can just get used to maybe 2% of the tweets actually being in the bounding box. 2% 0f 1% is after all only 0.02% of all Tweets or 1 Tweet in 5000.

So what happens if I assume that the word "traffic" will occur in the most useful Tweets? This is either bad science or common sense data filtering depending on how you look at it.

Alas, it appears that the Streaming API does not allow you to filter by location AND keyword! All you can do is do an OR filter, so I can filter the stream to include certain areas OR certain keywords, but not certain keywords within a certain location.

To me this just renders the API all but useless, but no doubt you lot are much smarter than I and will dazzle me with your great ideas.

Please let me know.