This is a follow up to yesterday's entry about cleaning county election data. The other half (closer to 90%) of my data puzzle was county data for a myriad of other factors found here (the first download). While this dataset provides a wide range of data about counties, it also has several places that have blank values or NaN. Now in some past projects and challenges, I would simply just replace the value with the mean or mode of the column. I thought for this one, I could do a bit better.
I decided to replace the NaN with the mean for the column for the state that the county was in. I figured this might not be the perfect approach, but given that NaN values ony make up a small percentage of data, this method shouldn't hurt my analysis. And given the short amount of time I have for this project, I didn't want to spend extra time on something until after I had a minimum viable product.
So here is the code I wrote in Python (with Pandas) to clean the NaNs:
county_data = pd.read_csv('county_data.csv') county_data = pd.DataFrame(county_data) for column in county_data.columns: for index, row in county_data.iterrows(): if pd.isnull(county_data.ix[index,column]): fill = county_data.loc[county_data['State'] == row['State']][column].mean() county_data.ix[index,column] = fill
What I Learned Today:
In 87% of the record 74 games of Jeopardy that Ken Jennings won in a row, he had more than double the points of his nearest opponent when Final Jeopardy started. Basically he's a boss.