Beware of Click-Baity Titles like ^
Another Project at Metis, and another detective that I don't know. Well, meet Detective McNulty of The Wire. He represents our 3rd major project here at Metis which involves machine learning and D3.js.
In this project, I am analyzing county level data from County Health Rankings in the US and election data from the last presidential elections.
Don't Trust Anyone, Especially The Guardian!
Cleaning data is always an interesting endeavor that has its fair set of challenges. If there's one thing I learned here, it is that you can't necessarily trust "pre-cleaned" data sources. I found the data on The Guardian on this page. I downloaded the excel file and found that there were 2 files. One was "FULL DATA" and one was "OBAMA V ROMNEY ONLY". Being the great data scientist I was, I picked the latter one since I only needed to know the data on Obama and Romney for my project. The data basically gives this info:
It gives the number of votes for Obama and Romney for each county in the US. This is perfect because my other data source is divided by county as well. But I started noticing some weird things about this data source...
I noticed that there were over 4,0000 rows, which is nice, except for the fact that there are only about 3,000 counties in the US. The number next to the county is called the FIPS number and each county has a unique one. But for about 4 or 5 states, there were many repeats, and this is why there was such a large number of counties. I found out that this was the case because for those states, they also included towns.
So I sent the file over to my cloud server and set up a PostgreSQL database. I copied the data in and ran some queries so I could sum up the values for each county where they included town-level data:
SELECT fips, sum(dem) as dem, sum(rep) as rep INTO election_data FROM election_2012 GROUP BY fips ORDER BY fips ASC;
Weirdly enough, now I only had 2,700 rows! It's better than the 4,000, but now I was missing several hundred counties! So I joined this data with my other data that had a known list and analyzed it a bit more.
SELECT * FROM county_data LEFT JOIN election_data ON county_data.fips = election_data.fips;
I found out that there were a few states that were missing! How did The Guardian just end up missing a few states... so I went back to the original data files and found this:
The first picture shows the "FULL DATA" while the bottom shows the "OBAMA V ROMNEY ONLY" data. Basically, in the full data set, it lists all the candidates and vote totals for each candidate for each county. BUT it's not always in order. As in, it doesn't always list the candidate with the most votes first and the second most votes second. Obama and Romney are the first 2 candidates / votes listed in almost all the data, except for 5 states. And those were the states that had the missing data.
Mystery solved! So I recleaned the data knowing that. Now how to deal with the NaN values in the county data dataset... coming soon.
You can find the follow up entry here.
What I Learned Today:
A set datatype in python is like a list that doesn't allow repeats.