Data

DIY Data Science, Part 1: P is for Pandas

April 23, 2017
PisforPandas

This last week, my programming brain has been all over the place. I’ve been looking into Natural Language Processing and trying to learn how to build a web application in Flask and it all felt like a huge complicated mess in my mind for most of the week. So I decided to turn to the familiar for comfort and kick off DIY Data Science with a topic that is near and dear to my heart – Pandas, Python’s superstar data analysis package.

Hello Pandas, my old friend

This is not the first time. Last December, after a few months of non-coding, I was desperate to get back into it. So my friend Zack offered to help me out and set me a little data cleaning and wrangling challenge to solve using Pandas.

It worked. After a few very frustrating and equally satisfying coding sessions, the spark was back. Thank you, Zack – you should follow him on Twitter to keep up with all the cool things he is working on.

When it was time to get started on this week’s project, I decided to go through this precise challenge again rather than do a completely new project, because it was tricky and fun and covered a lot of ground and if I’m going to do 24 weekly projects I want to ease myself into this. A quick note – I’m just focusing on transforming the dataframes; there is no analysis going on here.

The challenge

For this project, I worked with two dataframes. The first one was on 2016 US election result data, with each row listing county name, state, vote count and vote share for all four candidates. This is what it looked like:

electioncsv

The second file contained data on UFO sightings across the world. You can download the UFO data set on Kaggle – it’s a pretty fun dataset to play with.

ufocsv

Working my magic on those two data sets, I had to transform them into a third one – a table that for each state contained the share of both Trump and Clinton votes, the shape of UFOs spotted in that state, and how often that shape had been spotted in a particular colour. Easy enough, right?

resultcsv

The first step is always to load the module and import the files. I work in the Jupyter iPython notebook, which has been immensely helpful for me because it’s great to visualise what you’ve done immediately.

Import pandas as pd
election = pd.read_csv("")

1. Dropping all non-US observations in the UFO table

First, I got rid of all observations that were missing country specifications using the dropna() function. Then, I got rid of all rows that weren’t spotted in the US.

ufo = ufo.dropna(subset=['country'])
ufo = ufo[ufo.country.str.contains("us")]

2. This dataframe contains case-sensitive material

Python is case-sensitive, so I transformed all observations into lowercase first. Fun fact – this was my first ever use of a Python function.

def lowercase(column,df):
  df[column] = df[column].str.lower()
  lowercase("desc", ufo)

3. Dropping any UFO sightings that don’t specify colour

ufo7[ufo7['desc'].str.contains("black|blue|brown|cyan|grey|green|indigo|magenta|orange|pink|purple|
red|violet") ==True]

Now, this is not perfect. There might be more colours that I missed. If this had been a more serious project – not to belittle the political relevance of UFO sightings – I would have probably pulled a very exhaustive list of colours from the Internet and searched for every single one. But for the purpose of this practice, covering a few classics was enough.

4. Summing up the votes per state and determining vote share at state-level

The original election table shows votes at county-level. To get the votes at state-level, I used the Numpy module to first sum up the number of votes per candidate per state.

election_sum = election3.groupby(['state']).agg({'trump_vote': np.sum, 'clinton_vote': np.sum,
 'johnson_vote': np.sum, 'stein_vote':np.sum})

I then calculated the total votes per state and divided Clinton’s and Trump’s votes by that sum to get their vote share per state. We don’t care about Johnson and Stein. The election dataframe is ready to mingle!

5. Isolating the colour from UFO description

Next, I isolated these previously defined colours from the description column, creating new columns with the name of the colour and the value True or False for each UFO sighting. I then turned this into 0 for False and 1 for True.

def isolate_colour(df, search, column):
  df[search] = df.apply(lambda x: search in x[column], axis=1)

colours = ["black","blue","brown","cyan","grey","green","indigo", "magenta", "orange", "pink",
"purple", "red", "violet"]

for i in colours:
  isolate_colour (ufo7, i, "desc")

def change_int(df, column):
  df[column] = (df[column]).astype(int)

for i in colours:
  change_int (ufo7, i)

6. Merging dataframes

Time to join dataframes!

result = pd.merge(left=ufo7,right=election7, how='left', left_on='state', right_on='state')
result = result[["state", "trump_share", "clinton_share", "shape", "black","blue","brown","cyan",
"grey","green","indigo","magenta", "orange", "pink", "purple", "red", "violet"]]

I used a left-join to merge the two dataframes, which keeps all columns from the left dataframe as keys, using the state column as the common element, and supplementing the rows from the UFO dataframe with information from the election dataframe table. You can read all about joining, merging, and appending dataframes here.

almostthere

7. Group by state

But now I still have one row per UFO sighting. Time to aggregate some more! The group-by function is one of my favourite Pandas functions. It’s a bit like magic. All you need to do is to choose which columns you want to group by – I divide by state, and then split up the records by state according to their shape. I use the ‘sum’ function on the colour columns to count how often one colour pops up per state and shape. Clinton’s vote share and Trump’s vote share are just tagging along. Functions like group-by don’t change the original dataframe, so you need to define a new one to keep them for good. Group-by can do some pretty elaborate stuff, so it’s worth looking into for any kind of advanced analysis. This is a nice summary.

result2 = result.groupby(by=['state', 'shape', 'clinton_share', 'trump_share'])["black","blue",
"brown","cyan","grey","green","indigo","magenta", "orange", "pink", "purple", "red", "violet"].sum()

8. Cleaning up

The only trouble with group-by is that it can mess up the index a little bit, so I reset the index to make everything look nice and smooth.

result2 = result2.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')

9. Export & you’re done

All that was left to do was to save the dataframe to CSV – and voila! Here we go.

result2.to_csv('result_ilinca.csv')

Now, while there was quite a bit of data wrangling involved in this project, I certainly didn’t use every single function that there is. There is plenty more where that came from. Sometimes all it takes is one line of standard code, sometimes you need to patch code together and trial and error. Either way, it’s usually worth typing into Google exactly what you need if you find yourself struggling. Someone has probably already asked your question on Stackoverflow so you don’t have to humiliate yourself in the presence of experts. Most of the time, however, the Pandas documentation is all you need.

To me, Pandas is essentially a neater version of what I imagine Excel to be at its best, but within the Python environment. Whatever you are able to do in Excel, you can probably also do in Pandas – including visualisation.

But that is a story for a different letter.

You Might Also Like

No Comments

Leave a Reply