DIY Data Science, Part 4: G is for Graph Analysis

in Data

It’s been tricky keeping this series up! I’ve started learning about a lot of things at the same time and it’s been quite difficult to neatly isolate one bit to write about. A few weeks ago, however, I decided my social network analysis skills (or, as I like to call them here, my graph analysis skills – poor G was an unpopular letter while S was in high demand) needed a little refresher. I dedicated a big chunk of my Master’s degree last year to network analysis, and I loved it, but I hadn’t done anything in that space since. I mainly just wanted to check that I still had it – that I was still able to extract and build a network from messy social data without having a minor or major nervous breakdown. There’s no actual analysis in this post – just plenty of good old data wrangling and a nice little picture that visualises the network. Of course, the next step would be to analyse the network in-depth, but this is more a summary of how one might go about building a network in the first place. All you need to know is that nodes are, in this case, people in the network; edges are links between people, and edge weight determines how strong a link is according to how often it occurs.

The data: Hello Twitter, my old friend

After covering Reddit and Youtube, I’ve wanted to teach myself how to build a network from Twitter data for a little while now. There are lots of different ways to do this once you’ve got the data. You can look at hashtag usage, for example; you could build a network of user mentions. Or, in my case, you can look for a pre-determined set of keywords to find out which keywords tend to be mentioned in the same tweet.

To build my data set, I used the Twitter Streaming API and collected data during an episode of Love Island – the British reality dating show that’s capturing the nation’s attention from 9pm to 10pm, every single night – tracking the official hashtag #LoveIsland. I wanted to build a network of contestants to see which couples tend to be mentioned together the most, and which Islanders – my nodes in the network – people don’t really engage with online, which turned out to be a pretty good predictor of who would and who wouldn’t make it through to the next episode.

Getting messy

Extracting network ties between contestants from tweets turned out to be so much messier than I could have imagined. I’ve uploaded the code for this week if you want to browse it – no judgement, please.

In theory, this kind of network is what you call a bipartite network – there’s no direct link between Olivia and Chris, for example, or Montana and Alex. There are tweets, there are contestants, and there are indirect links between contestants because their names pop up in the same tweets, so this type of network infers relationships from co-occurence. Usually, you analyse bipartite networks by projecting them into a one-mode network (a network that only has one type of node) using a library like NetworkX in Python. In this case, however, I decided to do this projection myself, and extract the ties between couples using all the data wrangling tricks that my favourite library Pandas has up its sleeves.

First, I defined a function to extract keywords – the names of the 22 contestants – from the tweet text and fill a column named after each contestant with their name if it was present in that specific tweet. I then dropped every tweet that mentioned less than two contestants, since I was interested in relationships and not individual criticism.

Next, I counted the number of missing values in each row and added this number to a new column. I did this so I could drop every row that didn’t exactly have 20 missing value. In this case, 20 missing values meant that precisely two people were mentioned in that tweet – no more, no less; I was looking for people talking about pairs, not listing contestants.

After this, I fiddled with the data until I was able to create a new dataframe that for each tweet only consisted of two columns – the two names mentioned. I then counted (with the help of group-by) how often each couple was mentioned to determine the weight of their edge. Somewhere in this process, I also manually added in a column called ‘Type’ so I could load the network to Gephi as an undirected network. Are you still following? I warned you it was going to get a little messy, but hey, it worked.

Finally, I read my edge table into Gephi to visualise the Love Island network:


You can spot strong couples, love triangles, budding romances and utterly irrelevant contestants. Unsurprisingly, most of the Islanders on the periphery did not make it to the end of that episode. Things in the Love Island villa have changed since then, as loyal fans of the show will know, but the live Twitter conversation painted a pretty accurate picture of what was going on at that time.

If I recreated this in future, I would add one extra step (who am I kidding – probably more like 20 extra steps) in order to define links as negative or positive, according to tweet sentiment, to inject some nuance into the model.

Any questions or comments, please let me know – I’d love to hear what you think!

Leave a Reply

Your email address will not be published.