Hadoop Tutorial 3.4 -- Uploading a Java version of WordCount to AWS

From DftWiki

Jump to: navigation, search



This tutorial illustrates how to upload a compiled Java application to S3 and run it on EC2.

Page under construction!

The information presented here was gathered from several places, including

Software prerequisites

You will need these 3 packages for this tutorial

  • Java 1.6, preferably from Sun
  • Hadoop 0.18.
  • The Boto Ruby command line tools


  • Verify that your local computer runs java 1.6. By local computer, I mean any computer that you have access to, including your laptop, beowulf, xgridmac, or even one of the hadoop cluster machines in Ford Hall.
javac -version
If not, download it from sun (http://java.sun.com/javase/downloads/index.jsp) and install it.

Hadoop 0.18

Amazon uses Hadoop 0.18. For this reason, any Hadoop/MapReduce application written in Java should be compiled with the Hadoop 0.18 core libraries.

  • Download hadoop-0.18.3
  • unzip it and install it in the folder of your choice, say hadoop-0.18.3

Amazon Elastic MapReduce Ruby Client

Download the command-line client for creating, describing, and terminating Job Flows on the Amazon Elastic MapReduce platform: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264&categoryID=266

  • Follow the directions from the Ruby client Web page. You may have to install ruby if it is not installed on your computer.
  • You need to setup a credential file so that your Ruby client can issue command to AWS in your name. Use emacs or your favorite editor and create a file called credentials.json in the directory where you will have unzipped the ruby client.
   "access_id":   "insert your aws access id here",
   "private_key": "insert your aws secret access key here",
   "keypair":     "insert the name of your amazon ec2 keypair here",
   "log_uri":     "insert the name of a bucket in s3 to place logs from your job"
  • Use the AWS console to locate the different information requested. The log_uri string should be something of the form s3://yourfoldername//logs/.

The Java WordCount Program

Does not quite work... Some debugging required...
  • On your local machine, create a temporary directory where you are going to compile the java program. Let's call it temp and create it in the hadoop-0.18.3 directory.
 cd hadoop-0.18.3  
 mkdir temp
 cd temp
  • copy the demo Java program that comes with hadoop
 cp ../src/examples/org/apache/hadoop/examples/WordCount.java .
  • Edit the WordCount.java and change the package to something simpler:
 /* package org.apache.hadoop.examples; */
 package org.myorg;
  • Create a directory for the classes to be created:
 mkdir wordcount_classes
  • Compile the WordCount.java program
 javac -classpath ../hadoop-0.18.3-core.jar -d wordcount_classes WordCount.java
  • Create a java archive for distribution
 jar -cvf wordcount.jar -C wordcount_classes/ .           (don't forget the dot at the end!)
  • Upload the wordcount.jar program to your S3 folder.
  • Create a new job-flow and launch the job
 elastic-mapreduce --create --alive
 Created job flow j-2RWSYENM3ILIM

 elastic-mapreduce --jobflow j-2RWSYENM3ILIM  --jar s3://dft/prog/wordcount.jar --arg s3://dft/data/ --arg s3://dft/output/ 
  • Switch to the AWS console and refresh it. Observe that you have a new job running.
  • Currently ends with errors:
  • controller
2010-03-19T02:57:45.842Z INFO Fetching jar file.
2010-03-19T02:57:51.167Z INFO Working dir /mnt/var/lib/hadoop/steps/1
2010-03-19T02:57:51.167Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java 
 -cp /home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:
  /home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/1 
 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop 
 -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/1/tmp 
  /mnt/var/lib/hadoop/steps/1/wordcount.jar s3://dft/data s3://dft/output
2010-03-19T02:57:52.248Z INFO Execution ended with ret val 255
2010-03-19T02:57:52.249Z WARN Step failed with bad retval
2010-03-19T02:57:54.295Z INFO Step created jobs:

  • stderr
 java.lang.ClassNotFoundException: s3://dft/data
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:247)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
	at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 	at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)