Political Sense From The Census

Old Census

Tricky Presidential Elections

Presidential elections, contrary to popular belief, are not decided by the majority. And in this sense, the United States is not a traditional "democracy" (like in old school Greece) as many mistakingly and often purposely, but not deceptively, taught to us as kids. The electoral college is made up "electors" from each state. Each state has the same number of electors as they do representives in the House of Reps plus 2 for the number of senators. Within each state, whichever candidate gets a majority vote will get all of the states electoral votes (except for Nebraska and Maine, they assign electoral votes proportionally). So if a candidate won California with 51% of the electoral vote, he or she would still get all 55 of California's massive electoral vote count while a candidate with 49% would get 0. This naturally leaves room for presidential candidate to win without getting the majority of votes. This happened as recently as with Al Gore and George Bush. Gore actually won the popular vote by almost half a million votes.
Popular Vote

TL:DR

If you just want to play around with the visualizations, check it out here: The Formidable Census Data Explorer.

Data Sources

For this project, I wanted to figure out what variables best predict whether a county will go Republican or Democrat. And to do this I "borrowed" census data, mostly from the County Health Rankings & Roadmaps program which is a collaboration between the Robert Wood Johnson Foundation and the University of Wisconsin Population Health Institute. From here I parsed through data that included nearly 100 categories for each of the 3000+ counties in the US. I shrunk the data down to about 65 categories and added a few categories of my own.

Cleaning the Data

Naturally, there was some data missing, but luckily the great majority of data was complete. Given the timeframe I had for this project, I didn't want to spend an inordinate amount of time cleaning the data, but I also didn't want to compromise the accuracy of my results. So I did a pretty simple clean. Gross inaccuracies that were several standard deviations out, I cleaned by hand and filled in from other sources I could find. And for everything else, I filled in the missing data with the average from that particular state.

All I See Is Red

All Red One of the first things I noticed right away is that most of the country is RED! And I don't mean just a simple majority. 2,454 out of the 3,141 counties and county-equivalents I included in my data were Republican leaning! That's 78.1 percent! How is this possible? Well it's actually really simple. The graph below, taken from my D3 visual of this project (click here to see it), the darkest shades represent high population areas. We can see that almost all the high population areas voted Democrat. There are almost no dark red counties.

Density

Machine Learning

In figuring out a model to predict whether a county would lean Republican or Democrat, I tried out several different classifiers. All in all, I tried: Logistic Regression, Support Vector Machines, Gaussian Naive Bayes, K Neighbors, Decision Trees, Random Forests, and Extremely Random Trees. For the outcomes, I created a new column with 1s and 0s, with 1 being Democrat and 0 being Republican. I did it this way very intentionally. With nearly 80% of counties being Republican, that meant that if I had a model that predicted all 1s, it would be 80% accurate. In trying to predict Democrats, I was making it a little bit harder for my models. So here are my results:
ROC Curves Thee y-axis represent the true positive rate while the x-axis represent the false positive rate. You can see that for the most part, these models performed decently with SVM and Logistic Regression performing especially well.
Other Metrics This shows that they are all, for the most part, within the same ballpark of each other. While accuracy and precision were decent, recall and f1 suffered. The recall score means that the models only predicted about 60% of all the actual Democrat-leaning counties. F1 is a weight average of Precision and Recall and they were all in the 65-70% range.

Further Analysis

I spent a lot of time thinking about the results and how to improve it, and realized that I wasn't using all the data that I had at my disposal. Many states were very closely divided when it came to political parties. If a county was 51% democrat-leaning, it would still be a 1. Counties like this would definitely give my models trouble and decrease the accuracy and precision. Luckily, I also had the actual voting data from the elections. So I created a percent based outcomes column. Instead of binary predictions, I retrained my models to predict percentages.
Regressors

Because of the nature of regressors and not being a black and white classification, you don't get scores for accuracy or precision, but instead you get the Mean Squared Error, which is a measurement more similar to variance or standard deviation. The Extra Trees Regressor and Random Forest seemed to minimized the MSE pretty well. A very cursory analysis would show that these models were off by an average of about 7% each time, which is not great, but also not bad given the nature of the data.

Top Predictors

One of the great things about Trees, is that you can easily figure out which features are most "important" and Sci Kit Learn makes it even easier. It just requires one line of code.

trees = ExtraTreesClassifier(n_estimators=150, bootstrap=True)  
trees.fit(X,y)  
importances = trees.feature_importances_  

Before running this code, I tried to predict what the top predictors would be and I made a potential top 5 list:

My Top 5 List:

  1. Population Density
  2. Average Income
  3. African American Percent
  4. Over-65 Percent
  5. HS Graduation Percent

The actual list is this:

The Real Top 15 list:

  1. Non-Hispanic White Percent
  2. African American Percent
  3. Rural Percent
  4. Physically-Inactive Percent
  5. Single-Parent Household Percent
  6. Uninsured Percent
  7. Obese Percent
  8. Asian Percent
  9. STD Rate
  10. Homicide Rate
  11. Population
  12. Population Density
  13. Under-18 Percent
  14. Healthcare Costs Average
  15. Hispanic Percent

There is no causation in this data, just predictive power, so don't read too deeply into this.

Conclusions

It was definitely not that easy to predict how counties will vote with any complete certainty. But there is still definitely a lot of potential there, and if you want to predict within 10 percentage points, you can do so with surprising accuracy.

The Viz

For my visualization of this data, I was inspired by the lack of good interactive visual out there in terms of census data. Most of them had very bulky interfaces with information that was buried layers deep with page reloads at every step. So I wanted to create an interface that was incredibly easy to use without forced page loads, so plenty of AJAX implementation, and easy to see trends.

Check it out here.
Census Viewer

Further Thoughts

The next thing I would like to add to my visualization is to connect the Regression model that I developed. So with a click of a button, I would want it to predict a new possible election map and determine which party would win the election!


What I Learned Today:
Only two states, Nebraska and Maine, allocate their electoral college vote in proportion to the popular vote. The other 48 states and the District of Columbia declare all of their electoral votes to the winning popular candidate in the state, despite the margin of victory.