CSC352 Hadoop Cluster Howto

From CSclasswiki
Jump to: navigation, search

Contributing Members

This HOWTO was created by the members of the CSC352 Spring 2010 Parallel and Distributed Programming class.

  1. LeNguyen.jpg Le Nguyen
  2. RaeRecto.jpg Ellysha Raelen Recto
  3. Xiao Ting Zhao
  4. Yang Li
  5. Alex Cheng
  6. Gillian Riggs
  7. Margaret Zaccardi


  1. Select a partner and form a team
  2. Pick one of the PCs that is not already selected
  3. Put a marker on the PC indicating that it "belongs" to your team.
  4. Connect a screen, keyboard, mouse and Internet connection. You may want to move the PC to one of the spare lab tables while installing the software on it.
  5. Create a bootable Ubuntu Linux Workstation x86 CD (if you have done this before, skip this step). Note that Ubuntu CD is available on the cloud cluster.
  6. Install Ubuntu Linux Workstation and Hadoop 0.19.2 on "your" PC. Follow, edit, improve the set of instructions shown below. Add your name at the top of the page when you start editing the page. Feel free to add pictures, screen captures, images, new pages if necessary. (Note: if you add a new page, make sure the word CSC352 is the first word in the name of the page.)
  7. Time frame: until the last day of class before Spring Break.

Workstation Setup


This is a quick list of the actions that are needed... Individual teams should edit this page to make it fool-proof for anybody to setup such a cluster...

  • From Rae and Le's experience, it will take approximately 3 hours to complete until Hadoop installation.
  1. The workstations are Dell Optiplex 745 or GX620.
  2. Connect all the cables (video, keyboard, mouse, power, ethernet) and power up
    • Monitor needs a molex adaptor that is currently connected to the cloud cluster
    • Keyboard and mouse are both located under the cloud cluster
    • The monitor and tower takes a 300W power cord, two are connected to the cloud cluster
  3. Verify that the workstation boots as a Windows XP machine with access to the Smith network.
  4. Do not continue on until the previous step is successful!
  5. If you do not know how to make a disk-image of a Linux ISO distribution, please create a bootable desktop/workstation Ubuntu 9.10, 32-bit, X86 CD. A disk is also available near the router.
  6. Boot the machine up from the CD-Rom which contains a bootable Ubuntu 9.10 Workstation.
    • Use the following steps if you do not know how to boot from a CD:
      • 1. Restart the computer after a successful login.
      • 2. While rebooting, press DEL or F2 to enter Setup mode.
      • 3. Scroll up and down using the arrows, until you find the option to change boot mode.
      • 4. Press enter to choose this mode.
      • 5. Use the U/D keys to move the "CD-ROM" option up to the top.
      • 6. Press enter to save, and then exit setup.
  7. Accept the following settings
    • Language: English
    • Time Zone: US Eastern
    • Keyboard Layout: US
  8. Partition screen shot using Optiplex 745
    When asked for the partition, keep Windows XP and Fedora (if on the disk) but reduce XP to 64GB, keep Fedora at 10GB and use the remaining space for Ubuntu.
    • Note: there is one machine that can't be resized because the disk is full (Yang and Xiao Ting put sticker on that machine).
    • When using Optiplex 745, select the option "Install them side by side, choosing between them each startup", then a slider should appear. This slider allows you to adjust XP to 64GB. Note that it did not allow Fedora to be adjusted.
  9. Go ahead and repartition/format the drive... This may take a long time... (it took about 7-10 minutes)
  10. Identity:
    • name: hadoop
    • login: hadoop
    • password: (ask DT)
    • machine name: hadoop1 (or hadoop2, hadoop3, etc.)
    • Rae and Le took machine name hadoop2.
    • Xiao Ting and Yang took machine name hadoop5.
    • Alex and Gillian took machine name hadoop4.
    • Diana and Tonje took machine name hadoop3.
    • Select "Require password to login"
  11. Restart the machine when prompted (For Diana and Tonje installation itself took 14min).
  12. Using the Synaptic package manager, install openssh.
    • If the package cannot be found, run sudo apt-get update in the command line. ProblemEncountered.jpg UpdatesDownload.jpg
    • To install openssh via command line, run sudo apt-get install openssh-server
  13. type ifconfig to get the computer's IP address.
  14. ssh to it from your favorite location to continue the installation...
    • You can do it via command line via ssh username@ipaddress and enter the password when prompted.
  15. download, setup and install Hadoop. Use hadoop-0.19.2 and not the most recent: The O'Reilly hadoop book and most of the documentation on the Web assumes 0.19.2, and not 0.20.
    • Make sure that java is installed and is version 1.6 by typing java -version at the shell prompt. Verify that you get Java Version 1.6xxx. If not, you can download Java here.
      • The version of Java downloaded (and it seemed to work) was "Linux (self-extracting file)". The instructions for how to install it are on the Java website.
    • A good reference guide to the installation process can be found here.
    • You can download hadoop here.
  16. Install emacs
    sudo apt-get install emacs
  17. Create a script called that will record Ip addresses to database (because Ips might change)
  18. make the script executable
    chmod +x bin/
  19. create a cron job to run the script at regular intervals
    crontab -e
    */10 * * * * /home/hadoop/bin/
  20. Install apache2 (either from the command line or from the syntaptic package manager)

Ip Addresses

This section is only visible to computers located at Smith College

Make sure the server syncs its time to a time server

  • create a crontab entry in the superuser's account:
  sudo -i                (enter hadoop users's password)
  crontab -e
enter this line (followed by Enter key)
  */10 * * * * /usr/sbin/ntpdate
save file.
  • Now the hadoop server will sync its time to every 10 minutes.

Setting Up Hadoop Cluster

  • There are 4 documents that provide a wealth of information for setting up a cluster:

Hadoop on Ubuntu

  • Another document came in handy when the Reduce phase was not working:
Hadoop on Ubuntu     

Software Prereqs

mkdir temp
cd temp
tar -xzvf hadoop-0.19.2.tar.gz 
mv hadoop-0.19.2 ..
ln -s hadoop-0.19.2 hadoop
  • Now you should have a hadoop directory in your main account. Verify that you can get to it:
cd hadoop
ls -1           (that's minus one)

  • Install java
sudo apt-get install sun-java6-jdk
accept the license agreement.
  • Verify that hadoop responds
cd hadoop
Usage: hadoop [--config confdir] COMMAND 
where COMMAND is one of:
If you get the Usage: information, you're good.

Configuring the Cluster

  • Most of the steps below should be done on hadoop1.

passwordless ssh

  • generate keys on hadoop1
ssh-keygen -t dsa -P ""  -f ~/.ssh/id_dsa
cat ~/.ssh/ >> ~/.ssh/authorized_keys
  • copy to hadoop2, 3, 4, and 5.
 ssh 131.229.nnn.nnn   (where you get these from

 cd .ssh
 rsync -av .
 rsync -av .

This section is only visible to computers located at Smith College

Edit /etc/hosts and Define Hadoop Machines

  • add the following lines to the /etc/hosts (in sudo mode) on each machine. Comment out the machine which is already defined as hadoop5 hadoop4 hadoop3  hadoop2 hadoop1    

Create conf/hadoop-site.xml

  • on Hadoop1, 2, 3, 4, 5 and 6 create and edit conf/hadoop-site.xml

Create conf/slaves file

  • the conf/slaves file should contain the IP addresses of all the hadoop machines except the hadoop1 machine (which is the master node):

Create conf/masters file

  • edit as follows.
This makes hadoop1 the secondary NameNode. The primary is also hadoop1.


  • edit ~/.bash_profile, and add this line
EXPORT HADOOP_HOME="/home/hadoop/hadoop"
  • source it!
source ~/.bash_profile

Create Directory Support on NameNode (hadoop1)

 mkdir -p /home/hadoop/dfs/name

Create tmp directory for Data Nodes

  • on hadoop2, 3, 4, 5 and 6:
mkdir -p /tmp/hadoop
mkdir -p /home/hadoop/dfs/data

Format HDFS

  • on hadoop1
hadoop namenode -format
10/03/15 22:19:54 INFO namenode.NameNode: STARTUP_MSG: 
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop1/
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.19.2
STARTUP_MSG:   build = \
                                    -r 789657; compiled by  'root' on Tue Jun 30 12:40:50 EDT 2009
Re-format filesystem in /home/hadoop/dfs/name ? (Y or N) Y
10/03/15 22:20:04 INFO namenode.FSNamesystem:      \
10/03/15 22:20:04 INFO namenode.FSNamesystem: supergroup=supergroup
10/03/15 22:20:04 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/03/15 22:20:04 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/03/15 22:20:04 INFO common.Storage: Storage directory /home/hadoop/dfs/name has been successfully formatted.
10/03/15 22:20:04 INFO namenode.NameNode: SHUTDOWN_MSG: 
SHUTDOWN_MSG: Shutting down NameNode at hadoop1/


  • on hadoop1, edit hadoop/conf/ and set the JAVA_HOME line:
# The java implementation to use.  Required.
#export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/java-6-sun
  • replicate it on all the other machines:
 for i in 103.204 100.208 103.202 99.216 ; do  
 rsync  131.229.$i:hadoop/conf/

Start the Cluster

  • 1) start the DFS 
starting namenode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-namenode-hadoop1.out starting datanode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-hadoop4.out starting datanode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-hadoop3.out starting datanode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-hadoop2.out starting datanode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-hadoop5.out starting secondarynamenode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-
  • 2) start task trackers
starting namenode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-namenode-hadoop1.out starting datanode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-hadoop4.out starting datanode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-hadoop2.out starting datanode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-hadoop3.out starting datanode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-datanode-hadoop5.out starting secondarynamenode, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-
starting jobtracker, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-jobtracker-hadoop1.out starting tasktracker, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-tasktracker-hadoop4.out starting tasktracker, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-tasktracker-hadoop3.out starting tasktracker, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-tasktracker-hadoop2.out starting tasktracker, logging to /home/hadoop/hadoop/bin/../logs/hadoop-hadoop-tasktracker-hadoop5.out

Test The Cluster

  • To see if the expected processes are running:
    • On hadoop1
16404 NameNode
16775 Jps
16576 SecondaryNameNode
16648 JobTracker
    • On hadoop2, 3, 4, or 5
14718 Jps
14532 DataNode
14655 TaskTracker
  • To see if Hadoop is listening to the right ports:
sudo netstat -plten | grep java
tcp6       0      0 :::47489                :::*                    LISTEN      1000       88851       16648/java      
tcp6       0      0          :::*                    LISTEN      1000       88356       16404/java      
tcp6       0      0          :::*                    LISTEN      1000       88854       16648/java      
tcp6       0      0 :::50090                :::*                    LISTEN      1000       89183       16576/java      
tcp6       0      0 :::50030                :::*                    LISTEN      1000       89121       16648/java      
tcp6       0      0 :::43893                :::*                    LISTEN      1000       88740       16576/java      
tcp6       0      0 :::51957                :::*                    LISTEN      1000       88353       16404/java      
tcp6       0      0 :::50070                :::*                    LISTEN      1000       88905       16404/java

Run Map-Reduce WordCount Program

  • Login to hadoop1 and create a directory for yourself:
mkdir dft          (use your own initials!)
cd dft
  • Download a copy of James Joyce's Ulysses:
cat 4300.txt | head -50
Verify that "Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed."
  • Copy the data to the Hadoop File System (HDFS):
hadoop dfs -copyFromLocal /home/hadoop/dft dft 
hadoop dfs -ls
Found x items
drwxr-xr-x   - hadoop supergroup          0 2010-03-16 11:36 /user/hadoop/dft
The directory is there. Check its contents:
hadoop dfs -ls dft
Found 1 items
-rw-r--r--   2 hadoop supergroup    1573044 2010-03-16 11:36 /user/hadoop/dft/4300.txt
  • Run the wordcount java program from the example directory in hadoop:
hadoop jar ../hadoop/hadoop-0.19.2-examples.jar wordcount dft dft-output
10/03/16 11:40:51 INFO mapred.FileInputFormat: Total input paths to process : 1
10/03/16 11:40:51 INFO mapred.JobClient: Running job: job_201003161102_0002
10/03/16 11:40:52 INFO mapred.JobClient:  map 0% reduce 0%
10/03/16 11:40:55 INFO mapred.JobClient:  map 9% reduce 0%
10/03/16 11:40:56 INFO mapred.JobClient:  map 27% reduce 0%
10/03/16 11:40:58 INFO mapred.JobClient:  map 45% reduce 0%
10/03/16 11:40:59 INFO mapred.JobClient:  map 81% reduce 0%
10/03/16 11:41:01 INFO mapred.JobClient:  map 100% reduce 0%
10/03/16 11:41:09 INFO mapred.JobClient: Job complete: job_201003161102_0002
10/03/16 11:41:09 INFO mapred.JobClient: Counters: 17
10/03/16 11:41:09 INFO mapred.JobClient:   File Systems
10/03/16 11:41:09 INFO mapred.JobClient:     HDFS bytes read=1576605
10/03/16 11:41:09 INFO mapred.JobClient:     HDFS bytes written=527522
10/03/16 11:41:09 INFO mapred.JobClient:     Local bytes read=1219522
10/03/16 11:41:09 INFO mapred.JobClient:     Local bytes written=2439412
10/03/16 11:41:09 INFO mapred.JobClient:   Job Counters 
10/03/16 11:41:09 INFO mapred.JobClient:     Launched reduce tasks=1
10/03/16 11:41:09 INFO mapred.JobClient:     Rack-local map tasks=6
10/03/16 11:41:09 INFO mapred.JobClient:     Launched map tasks=11
10/03/16 11:41:09 INFO mapred.JobClient:     Data-local map tasks=5
10/03/16 11:41:09 INFO mapred.JobClient:   Map-Reduce Framework
10/03/16 11:41:09 INFO mapred.JobClient:     Reduce input groups=50091
10/03/16 11:41:09 INFO mapred.JobClient:     Combine output records=88551
10/03/16 11:41:09 INFO mapred.JobClient:     Map input records=33055
10/03/16 11:41:09 INFO mapred.JobClient:     Reduce output records=50091
10/03/16 11:41:09 INFO mapred.JobClient:     Map output bytes=2601773
10/03/16 11:41:09 INFO mapred.JobClient:     Map input bytes=1573044
10/03/16 11:41:09 INFO mapred.JobClient:     Combine input records=267975
10/03/16 11:41:09 INFO mapred.JobClient:     Map output records=267975
10/03/16 11:41:09 INFO mapred.JobClient:     Reduce input records=88551
  • Check the output of the program
hadoop dfs -ls
Found x items
drwxr-xr-x   - hadoop supergroup          0 2010-03-16 11:36 /user/hadoop/dft
drwxr-xr-x   - hadoop supergroup          0 2010-03-16 11:41 /user/hadoop/dft-output
hadoop dfs -ls dft-output
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2010-03-16 11:40 /user/hadoop/dft-output/_logs
-rw-r--r--   2 hadoop supergroup     527522 2010-03-16 11:41 /user/hadoop/dft-output/part-00000
  • Look at the output
hadoop dfs -cat dft-output/part-00000 | less 

"Come   1
"Defects,"      1
"I      1
"Information    1
"J"     1
"Plain  2
"Project        5
zest.   1
zigzag  2
zigzagging      1
zigzags,        1
zivio,  1
zmellz  1
zodiac  1
zodiac. 1
zodiacal        2
zoe)_   1
zones:  1
zoo.    1
zoological      1
zouave's        1
zrads,  2
zrads.  1
  • Remove the output directory (recursively going through directories if necessary):
hadoop dfs -rmr dft-output

Adding another DataNode

The hadoop cluster manages its data in a hierarchical fashion, where the master node (hadoop1) is the namenode and the other slave nodes are datanodes.

This section illustrates how hadoop6 ( was added to a working cluster of hadoop1, 2, 3, 4 and 5.

  • install Sun Java 6 on hadoop6 as illustrated in a section above.
  • copy the .bash_profile file from hadoop5 to hadoop6.
  • copy the whole hadoop-0.19.2 from hadoop5 to hadoop6. On hadoop6, do
 rsync -av  .
 ln -s hadoop-0.19.2 hadoop
  • Tell hadoop1 that it has a new host in its network
    • Edit /etc/hosts and add a new entry hadoop6
    • restart the network on hadoop1 so that it reads its /etc/hosts file again:
 sudo /etc/init.d/networking restart
  • make sure Hadoop is not running. On hadoop1:
  • The easiest way to add a new node is to first remove the complete hadoop file system, add the node, and rebuild the file system. On hadoop1, remove the hadoop file system (and, unfortunately, all the data it contains; in our case the various books including Ulysses.txt):
  cd dfs
  rm -rf  * rm -rf /home/hadoop/dfs
  • reformat the hadoop file system. On hadoop1:
  hadoop namenode -format         (then enter Y to accept)
  • restart Hadoop
  • check that the TaskTrackers and the DataNodes are running on the slaves, and that the NameNode, JobTracker, SecondaryNameNode are running on the master (hadoop1).
  • create a dft (use your initials) directory in the hadoop file system
  hadoop dfs -mkdir dft
  • copy ulysses.txt from where ever it is located to the hdfs:
  hadoop dfs -copyFromLocal ulysses.txt dft
  • run the wordcount program:
  hadoop jar /home/hadoop/hadoop/hadoop-0.19.2-examples.jar wordcount dft dft-output

Rudimentary Backup

  • Created backup dir in ~hadoop on all nodes:
cd ; mkdir .backup    (on hadoop1) mkdir .backup    (to do it on all the other nodes)
  • Make sure bin directory is on all the nodes (it should be): mkdir bin

  • Created in ~hadoop/bin on all the nodes:
#! /bin/bash

cd /home/hadoop
tar -czf /home/hadoop/.backup/backup.tgz --exclude='352/wikipages/*'   hadoop/*  352/*  bin/*
  • Copy backup script from hadoop1 to all the slaves:
rsync -av bin/  hadoop2:bin/
rsync -av bin/  hadoop3:bin/
rsync -av bin/  hadoop4:bin/
rsync -av bin/  hadoop5:bin/
rsync -av bin/  hadoop6:bin/

  • Create a cron job on hadoop1 to backup hadoop1 and all the slaves
crontab -l 

# m h  dom mon dow   command
*/10 * * * * /home/hadoop/bin/
0 3 * * * /home/hadoop/hadoop/bin/ /home/hadoop/bin/
0 3 * * * /bin/tar -czf /home/hadoop/.backup/backup.tgz --exclude='352/wikipages/*' \
              hadoop/* 352/*  bin/*hadoop@hadoop1:~/bin$ 

Some recipes in case of trouble...

Error: Cannot delete /user/hadoop/xxx-output. Name node is in safe mode

 hadoop dfsadmin -safemode leave

Error: need to reboot a node

  • Update the slaves file on hadoop1, ~/hadoop/conf/slaves, to reflect the new Ip.
  • Modify the ~/.bash_profile file and change the Ip of the node just rebooted
  • Modify the /etc/hosts file and enter the Ip of the node just recorded
 sudo -i      (and when prompted, enter the hadoop password)
 cd /etc
 emacs -nw hosts

Error: Need to reformat the Hadoop File System

  • Warning: The steps shown here will reformat the Hadoop File System. This means that any data file stored on the Hadoop file system by yourself of by somebody else in the class will be erased! rm -rf /home/hadoop/dfs/data
 rm -rf ~/dfs/
 hadoop namenode -format 
 hadoop dfs -ls
 hadoop dfs -mkdir dft
 hadoop dfs -ls
 cd dft
 hadoop  dfs -copyFromLocal ulysses.txt dft
 hadoop dfs -ls 
 hadoop jar /home/hadoop/dft/wordcount.jar org.myorg.WordCount dft dft-output
 hadoop dfs -ls
 hadoop dfs -ls dft-output
 hadoop dfs -cat dft-output/part-00000
  • Verify that the last steps lists the index of all the words in Ulysses.txt

Generating the Task Timelines yields very strange graphs

  • A quick fix: connect to all the machines and run
  sudo ntpdate
this should synchronize the machines in the cluster.

Error reporting cluster in "safe mode"

NoRouteToHostException in logs or in stderr output from a command

"Remote host identification has changed" error

Error message when you try to put files into the HDFS

HadoopStreaming jobs (such as the Python example that fetches web page titles) won't work.

Modifying the Partition of a Hadoop Node with GParted

--Thiebaut 14:24, 12 May 2010 (UTC)

Repartition with GParted

  • Get GParted Live on a CD
  • Boot the Hadoop node with GParted
    • Accept kbd
    • Accept language
    • Video option: use 1) for force video. Select 1024x768, Vesa, and 24-bit color depth.
    • Figure out where hadoop 1 is mounted.
      • For this trial, hadoop / is /dev/sda7 (71 GB), and swap is /dev/sda8 (3GB).
  • Remove all partitions except sda7 and sda8 (which get renamed sda5 and sda6).
  • Resize and move partitions left to use the whole disk
  • sda5 moved to left becomes 229 GB
  • sda6 moved to right becomes 3 GB
  • Accept changes and wait (2-3 hours)

Reinstall Grub

  • Still in GParted mode: Open terminal window
  sudo fdisk -l

      sda4 extended
      sda5 linux ext4
      sda6 swap

  sudo mount /dev/sda5 /mnt
  sudo grub-install --root-directory=/mnt /dev/sda
  • reboot
  • back in Hadoop user session:
  sudo update-grub
  • All set!


If you want to explore programming in the cloud, check these tutorials out: