I wanted to be able to run geo-data calculations on Amazon Elastic Map Reduce using Hadoop streaming jobs – particularly in Python. While we cannot easily install required Python dependencies, this problem can be solved by using the cacheArchive feature of Hadoop.
In order to do this, first download the python-geoip standalone file from http://code.google.com/p/python-geoip/
Now you have both files:
Archive them into one file:
$ tar czvf geoip.tgz *
Upload to Amazon S3 using whatever means you need to:
$ s3cmd put geoip.tgz s3://bucket/
When you run your EMR job, make sure to specify this in the “extra arguments” option/section:
This tells EMR to untar the TGZ file and put it in a directory called “geoip” for the streaming job.
Make sure your mapper/reducer correctly loads the system path for the python-geoip *before* you import it:
sys.path.append('./geoip') import pygeoip
Open your file like this:
GEOIP = pygeoip.Database('geoip/GeoIP.dat')
Finally, your mapper job can use code similar to that found on the python-geoip site. Here’s a complete map example (working on an example file that only has IP addresses):
import sys sys.path.append('./geoip') import pygeoip # Load the database once and store it globally in interpreter memory. GEOIP = pygeoip.Database('geoip/GeoIP.dat') for line in sys.stdin: ip = line.strip() info = GEOIP.lookup(ip) if info.country: print info.country + "\t" + "1"