Getting Whirr running on EC2 with Cloudera’s script
I haven’t fully used Hadoop yet, but it looks like a pretty amazing tool for crunching large datasets. Combine Hadoop and Amazon EC2, and it should be possible to crunch large datasets with ephemeral EC2 instances fast. But I had problems getting Hadoop up and running on EC2…
I followed Cloudera’s instructions for setting up CDH3 scripts on the Amazon EC2 instances I was testing.
Everything went great. Until I got to the Whirr installation (which seems the easiest way to start up a number of nodes at once and have them auto-magically configured.)
Following these instructions gave me this error: “Non-Windows AMIs with a virtualization type of ‘hvm’ currently may only be used with Cluster Compute instance types.”
Luckily a search for this error message lead to these helpful links:
- http://mail-archives.apache.org/mod_mbox/incubator-whirr-user/201103.mbox/%3CD890DBF7-9434-47AD-BA1E-E492099473A6@daltonclark.com%3E
- http://mail-archives.apache.org/mod_mbox/whirr-user/201112.mbox/%3CCAHZL8y8OncJ3tkWYfgwTnPP_6ziW8D02Z0s8u5-XST+1rv79yw@mail.gmail.com%3E
After some trial and error, I added these lines to my hadoop.properties which worked:
==========
whirr.hardware-id=m1.large
whirr.image-id=us-east-1/ami-da0cf8b3
whirr.location-id=us-east-1
==========
Once there, everything worked great. One small thing is that when the whirr script completes setting up the instances, it will say that the web-based interfaces are live, but you will have to edit the security groups in AWS to accept incoming traffic from any IP address:
50030 0.0.0.0/0
50070 0.0.0.0/0
in order to actually see the web-based interface.
