Project Fletcher: Twitter Streaming Counties

Counties Given a longitude and latitude point, how do you determine what county it's in. This is the problem I have finally solved. First of all, it seems really simple, and there are definitely simple ways to do it, but given the large volume of data I needed to process, I need a super fast and efficient way to do it.

So there were several things I needed to solve, let me break it down:
1. Deciding whether a point is in a county
2. Wait, what are the boundaries of a county?
3. Now, how do we handle counties that have multiple shapes(ie. they encompass several islands)?
4. Hmm, I don't want to iterate through 3000 counties to see if a point is in it...

  1. How do you know if a point is within a polygon?
    If the polygon is a rectangle with side that are parallel to longitude and latitude, this would be very easy. But looking at the image above, each county is a multi-sided polygon (each one with an average of 100-200 sides and some encompassing multiple polygons, like counties in Hawaii and Alaska).

Solution: I looked through several sources and found this. The winning answer is one where you create triangles from each side to the point. If the area of the triangles is greater than the area of the polygon, then it means the point is outside. Whereas if it's equal, then it's inside. Great solution, but it seems pretty taxing. Luckily there is another solution that uses dot products. If M of coordinates (x,y) is inside the rectangle, then:

(0 < AM⋅AB < AB⋅AB) ∧ (0 < AM⋅AD < AD⋅AD)

In python, that translates to something like this:

def in_polygon(point, poly):  
    """Return True if point is in polygon path"""
    x = point[0]
    y = point[1]
    n = len(poly)
    status = False
    p1x,p1y = poly[0]
    for i in range(n+1):
        p2x,p2y = poly[i % n]
        if min(p1y,p2y) < y <= max(p1y,p2y):
            if x <= max(p1x,p2x):
                if p1y != p2y:
                    xinters = (y-p1y)*(p2x-p1x)/(p2y-p1y)+p1x
                if p1x == p2x or x <= xinters:
                    status = not status
        p1x,p1y = p2x,p2y

    return status
  1. What are the boundaries of a county?
    This was MUCH harder to find than I thought it would be. I eventually stumbled upon this site that had converted the county boundaries points into JSON. I then converted the JSON into a giant Python dictionary through a somewhat painstaking process. I also had to iterate through it and swap the order of all the points because Twitter reverses the order for some reason...

  2. How to handle nested counties?
    I noticed that there were some counties that had nested arrays in them. I quickly realized this was for counties that had were islands or encompassed 2 or more areas that didn't connect. So I built a Python function that would determine if the list was nested or not. So that in the next function, I can decide whether I have to loop through it again.

def find_nested(nested_list):  
    """Return 1 for nested geo coordinates and 0 if not nested"""
        nest = 1
        nest = 0
    return nest
  1. Now let's get down to business... Business Obviously I can't iterate through every county every time to determine if the point is in it, so I needed to find another way. I decided to go the nearest neighbor approach and used a KD-tree method in SK Learn. It's pretty easy to implement and the search has a big-O of log(n):
from sklearn.externals import joblib  
from sklearn.neighbors import KDTree  
import numpy as np

tree = KDTree(county_geo, leaf_size=5)  
joblib.dump(tree, 'county_tree.pkl')  

I used joblib instead of pickle because it's more efficient for trees. The tree also returns the 5 nearest neighbors.

Now before we actually write the function we have to decide on some things. If the point is not in any of the nearest neighbors, I'm going to assume that the point might be on a body of water or something that is right off the coast of the county or something and still count it as the nearest neighbor. But if the distance to the nearest neighbor is too big (greater than 6 long/lat units), I will just assume it's outside of the US and return None.

And now because of the form of my county geo-boundaries list, this is how I determine which county it was in:

def find_county(point, full=0):  
    """Given a long/lat point in the United States return a
    corresponding FIPS ID"""

    # load, run KD Tree, and return 5 nearest points
    county_tree = joblib.load('geo_tree/county_tree.pkl')
    dist, indices = county_tree.query(point, k=5)
    fips_list = []
    for index in indices[0]:

        # print county_geo[index][0],county_geo[index][1]
                fips = county_geo_dictionary[(county_geo[index][0],county_geo[index][1])]

    # set default values for return
    fips = fips_list[0]

    # Find which path it exists in
    if len(fips_list) > 0:
        for fips_num in fips_list:
            county_paths = all_county_boundaries[str(fips_num)]
            if find_nested(county_paths) == 0:
                if in_polygon(point, county_paths):
                    fips = fips_num
                for path in county_paths:
                    if in_polygon(point, path):
                        fips = fips_num

    # Point is most likely outside of US, return None
    if dist[0][0] > 6:
        return None

    # If point is near but not in county, return nearest county
    if full == 1:
            return [fips,county_fips[int(fips)][1],county_fips[int(fips)][0]]
            return None
        return fips

This is all on Github as a library here.

Here's a picture of a running demo I made with the twitter streaming API:
TW Streaming Locations

Using this python module I created, I was able to now create 3 new maps (tweet volume, tweet density, mood) based on the outcomes and added them to my Census d3 Project.

Tweet Volume

Tweet Density

National Mood

What I Learned Today:
The "sixth sick sheik's sixth sheep's sick" is said to be the toughest tongue twister in the English language