GeoIP on Amazon Elastic Map Reduce (EMR) using Hadoop Streaming (Python)

23 Apr

I wanted to be able to run geo-data calculations on Amazon Elastic Map Reduce using Hadoop streaming jobs – particularly in Python. While we cannot easily install required Python dependencies, this problem can be solved by using the cacheArchive feature of Hadoop.

In order to do this, first download the python-geoip standalone file from

And also the MaxMind database. The free country ( or city ones ( should both work.

Now you have both files:

$ ls

Archive them into one file:
$ tar czvf geoip.tgz *

Upload to Amazon S3 using whatever means you need to:
$ s3cmd put geoip.tgz s3://bucket/

When you run your EMR job, make sure to specify this in the “extra arguments” option/section:
-cacheArchive s3://bucket/geoip.tgz#geoip

This tells EMR to untar the TGZ file and put it in a directory called “geoip” for the streaming job.

Make sure your mapper/reducer correctly loads the system path for the python-geoip *before* you import it:

import pygeoip

Open your file like this:

GEOIP = pygeoip.Database('geoip/GeoIP.dat')

Finally, your mapper job can use code similar to that found on the python-geoip site. Here’s a complete map example (working on an example file that only has IP addresses):

import sys
import pygeoip

# Load the database once and store it globally in interpreter memory.
GEOIP = pygeoip.Database('geoip/GeoIP.dat')

for line in sys.stdin:
	ip = line.strip()
	info = GEOIP.lookup(ip)
		print + "\t" + "1"

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: