I haven’t fully used Hadoop yet, but it looks like a pretty amazing tool for crunching large datasets. Combine Hadoop and Amazon EC2, and it should be possible to crunch large datasets with ephemeral EC2 instances fast. But I had problems getting Hadoop up and running on EC2…
I followed Cloudera’s instructions for setting up CDH3 scripts on the Amazon EC2 instances I was testing.
Everything went great. Until I got to the Whirr installation (which seems the easiest way to start up a number of nodes at once and have them auto-magically configured.)
Following these instructions gave me this error: “Non-Windows AMIs with a virtualization type of ‘hvm’ currently may only be used with Cluster Compute instance types.”
Luckily a search for this error message lead to these helpful links:
After some trial and error, I added these lines to my hadoop.properties which worked:
Once there, everything worked great. One small thing is that when the whirr script completes setting up the instances, it will say that the web-based interfaces are live, but you will have to edit the security groups in AWS to accept incoming traffic from any IP address:
in order to actually see the web-based interface.