Hadoop Tutorial 3 -- Hadoop on Amazon AWS

From DftWiki

Jump to: navigation, search

--D. Thiebaut 15:59, 18 April 2010 (UTC)

Deprecated! See this new tutorial instead!



This tutorial illustrates how to connect to the Amazon AWS system and run a Hadoop/Map-Reduce program on this service. The first part of the tutorial deals with the wordcount program already covered in the Hadoop Tutorial 1. The second part deals with the same wordcount program, but this time we'll provide our own version. Part 3 presents a more sophisticated approach where the Java version of wordcount is compiled locally, then uploaded to S3 and run from there.

This tutorial uses information found in several other tutorials, including

Login to Amazon AWS

The following section will take you through the steps necessary to login to your Amazon Web Services (AWS) account.

Click on Sign In to AWS Management Console

AWS SignInToAWSManagementConsole.png

Sign in with your AWS account

AWS SignIn.png

You should then be signed in

AWS SignedIn.png

Sign-Up for EC2, S3, and MapReduce

  • Go to the Products section and sign-up for all these services, using your 352b.xx credentials.
    • EC2
    • S3
    • Elastic MapReduce
You should see that you are already have access to the Web service when you do so. This is just to double-check.
AWS YouAlreadyHaveAccess.png

Locate your Amazon credentials

  • Go to Account/Security Credentials

AWS SecurityCredentials.png

Access Key, Secret Key

  • Locate your access key info, in the form of an Access key Id and a secret key. You will need them later to setup various tools.
AWS AccessSecretKey.png

The next step is to create a bucket in S3 and store Ulysses in it. The easiest way for this is to use Firefox and install the S3 Add-On on it.

3rd-Party Software Tools


Option 1: Firefox add-on for S3

  • Launch the Add-On and click on Manage Account, and then enter the information required. In Account Name enter a name based on the image below. You will need to enter the Access Key and Secret Key from above.



Option 2: Using the S3Tools

S3tools.org provides an open-source package for accessing S3 from the command line. It is a nice alternative to the Firefox Add-on especially if one is interested in automating file upload, download, or removal using shell script.

  • Download the S3Tools from s3tools.org and follow the directions for installation from the Web site.
  • Once installed, configure it from the command line:
 s3cmd --configure

Enter new values or accept defaults in brackets with Enter.
Refer to user manual for detailed description of all options.

Access key and Secret key are your identifiers for Amazon S3
Access Key: XXXXXXXXXXXXXXXXXXXX          (enter your access key here)
Secret Key: SSSSSSSSSSSSSSSSSSSSSSSSSSSSSS        (enter your secret key here)

Encryption password is used to protect your files from reading
by unauthorized persons while in transfer to S3
Encryption password: PPPPPPPPPPP                 (enter a password of your choice here)
Path to GPG program [/opt/local/bin/gpg]:  

When using secure HTTPS protocol all communication with Amazon S3
servers is protected from 3rd party eavesdropping. This method is
slower than plain HTTP and can't be used if you're behind a proxy
Use HTTPS protocol [No]: 

On some networks all internet access must go through a HTTP proxy.
Try setting it here if you can't conect to S3 directly 
HTTP Proxy server name: 

New settings:
 Encryption password: PPPPPPPPPPP            
 Path to GPG program: /opt/local/bin/gpg
 Use HTTPS protocol: False
 HTTP Proxy server name: 
 HTTP Proxy server port: 0

Test access with supplied credentials? [Y/n] 
Please wait...
Success. Your access key and secret key worked fine :-)

Now verifying that encryption works...
Success. Encryption and decryption worked fine :-)

Save settings? [y/N] y
Configuration saved to '/dddd/uuuuu/.s3cfg'

  • Create a new bucket
 s3cmd mb s3://352-abc
  • Upload a new file to your bucket
 s3cmd  put ulysses.txt  s3://352-abc/ulysses.txt
  • List the files in your bucket
 s3cmd ls s3://352-abc/
 2010-04-26 16:06    12345   s3//352-abc/ulysses.txt

Upload Ulysses to Amazon S3

In this section we will use the Firefox S3 Add-On.

  • If you do not have Ulysses handy, download it from gutenberg.org:
wget http://www.gutenberg.org/files/4300/4300.zip
unzip 4300.zip
rm 4300.zip
cat 4300.txt | head -50
  • Make sure your S3 Firefox GUI Add-On is open.
  • Create a new folder (Amazon calls it a bucket) in your S3 space. This has to be a unique name, in a shared space (all AWS users share this space). Make it a unique name, something like 352-dft, or 352-sharon should work.

AWS CreateBucket.png

  • Create three new sub-folders in your new folder:
    • data
    • output
    • logs

AWS dataLogsOutputFolders.png

  • In the left window, locate your text version of Ulysses and upload it to your new data folder on S3.

AWS UlyssesUploadedToData.png

Counting word frequency different ways

We will process Ulysses using different approaches, going from the simplest to the most sophisticated.

  • Part 1 : we'll use Amazon's own wordSplit program to process Ulysses.
  • Part 2 : Next we create our own version of the WordSplit program in Python and upload it to S3.
  • Part 3: In this tutorial we create our Java version of WordCount, compile it, upload it to S3, and run it there.

How Much Have I spent?

  • Please regularly check your credit with Amazon, which generously granted each student $100 of access time to their AWS services.
  • To do so:
    • Connect to AWS Management Console
    • Select Account
    • Select Account Activity
    • Look up the available credit:

AWS AvailableCredit.png