25th March 2015
I’ve been taking a look at creating simple demonstrations of the types of data analytics that can be easily produced using a typical Big Data stack, created using our stack builder.
The ability to capture and analyse streaming data is a key requirement for many users. The twitter API allows users to easily create streams of tweets and associated data in JSON format, which is expected to provide the basic data currency of many IoT applications. The twitter API allows the user to set up a feed of data with keyword filters; we decided to set up a feed of tweets which mentioned major brands and technology companies, hoping to link news stories and press releases to the levels of discussion on twitter.
The first attempt was a failure. The only interesting patterns we were able to identify within the captured data were regular increases in mentions caused by spambots and one tweet by professional song and dance men The One Direction (which contained one of our keywords in a different context). This illustrates the amount of noise on twitter, and the need to carefully plan exactly what you want to analyse (and when you want to do it).
For the second attempt, we booted a small Hadoop+Flume stack on-site at Analytics Engines which was configured to capture five days-worth of rugby-related tweets over the weekend of 14th March. This data was then added to our data library on cloud storage. We then deployed an analytics stack to AWS, importing the data automatically into the stack. We built a user interface (using Shiny, a web-based interface to R) which allowed us to query (via Apache Hive or Spark SQL) and filter the twitter data and produce the graphs shown here. The whole process took around two days of effort, most of which was spent designing the SQL query (I’m not very good at it) and coding up the shiny interface (luckily Matthew is good at this).
We captured a total of 24 million tweets over the five days, which took around 800 seconds to query with Hive on an 18-node cluster (700 seconds using Spark). The massive majority of this time is spent parsing the JSON data.
The graph shows the per-minute counts of the relevant hashtags for both Wales v Ireland (#WALvIRE – blue/green line – kick-off 2.30pm) and England v Scotland (#ENGvSCO – red line – kick-off 5pm).
Saturday 14th March: Wales v Ireland followed by England v Scotland
Both games have lots of interesting looking peaks – I enjoyed the Wales/Ireland match more, so we’re going to take a look at it in more detail, relating the peaks in twitter activity back to in-game events. The events in the image below are colour coded:
Wales v Ireland hashtag over the course of the match
The level of engagement on twitter was much greater in the second half of the game, presumably because that half was so exciting. It is interesting that such a large peak occurred at the end of the game, probably helped by a controversial penalty decision in the last minute.
Images were taken from: