Setup Virtual Hadoop Cluster under Ubuntu with VirtualBox

From DftWiki

Revision as of 14:28, 6 January 2017 by Thiebaut (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

--D. Thiebaut (talk) 14:35, 22 June 2013 (EDT)


Contents

                         

HadoopRecipeDFT.jpg


This tutorial is really a collection of recipes gleaned from the Web and put together to form a record of how a cluster of virtual servers located on the same physical machine was put together to create a Hadoop Cluster for a classroom environment. The information on this Web page is provided for informational purpose only, and no implied warranty is associated with it. These steps will help you setup a virtual cluster to the best of our knowledge, but some steps may not work as illustrated here for you. In such cases, please take the time to send an email to dthiebaut@smith.edu to help maintain the quality of the information provided. Thanks!

Server Hardware


Here are the specs of our main machine

  • ASUS M5A97 R2.0 AM3+ AMD 970 SATA 6Gb/s USB 3.0 ATX AMD Motherboard ($84.99)
  • Corsair Vengeance 16GB (2x8GB) DDR3 1600 MHz (PC3 12800) Desktop Memory (CMZ16GX3M2A1600C10) ($120.22)
  • AMD FX-8150 8-Core Black Edition Processor Socket AM3+ FD8150FRGUBOX ($175)
  • 2x 1 TB-RAID 1 disks
    • used Ctrl-F in POS to call RAID setup.
    • setup 2 1TB drives as RAID-1



Software Setup

Ubuntu Desktop V13.04 is the main operating system of the computer. VirtualBox is used to setup the virtual hadoop servers. Since the main machine has an 8-core processor, we create a 6-host virtual cluster.

While installing Ubuntu, use Ubuntu Software Manager GUI and install following packages:

  • Apache2+php (to eventually run virtualboxphp)
 sudo apt-get install apache2
 sudo apt-get install php5 libapache2-mod-php5
 sudo /etc/init.d/apache2 restart

  • VirtualBox
  • emacs
  • eclipse
  • python 3.1
  • ddclient (for dynamic hostname)
  • open-ssh to access remotely
  • setup keyboard ctrl/capslock swap
  • setup 2 video displays



Create The First Virtual Server



Bo Feng @ http://b-feng.blogspot.com/2013/02/create-virtual-ubuntu-cluster-on.html describes the setup of a virtual cluster as a series of simple (though high-level steps):

  • Create a virtual machine
    • Change the network to bridge mode
  • Install Ubuntu Server OS
    • Setup the hostname to something like "node-01"
  • Boot up the new virtual machine
    • sudo rm -rf /etc/udev/rules.d/70-persistent-net.rules
  • Clone the new virtual machine
    • Initialize with a new MAC address
  • Boot up the cloned virtual machine
    • edit /etc/hostname
      • change "node-01" to "node-02"
    • edit /etc/hosts
      • change "node-01" to "node-02"
  • Reboot the cloned virtual machine
  • Redo step 4 to 6 until there are enough nodes
Note: the ip address of each node should start with "192.168." unless you don't have a router.

That's basically it, but we'll go through the fine details here.

Create Virtual Server

This description is based on http://serc.carleton.edu/csinparallel/vm_cluster_macalester.html.

  • On the Ubuntu physical machine, download ubuntu-13.04-server-amd64.iso from Ubuntu repository
  • Mount the image so that the virtual machines can install directly from it: Open windows manager, go to Downloads, right click (need 3-button mouse) on ubuntu-13.04-server-amd64.iso and pick open with archive-mounter
  • Start VirtualBox (should be in Applications/Accessories)
  • Create new virtual box with these attributes:
    • Linux
    • Ubuntu 64 bits
    • 2 GB virtual RAM
    • Create virtual hard disk, VDI
    • dynamically allocated
    • set the name to hadoop100, with 8.00 GB drive size
VirtualHadoopCluster VirtualBoxCreateHadoop100.png
    • install Ubuntu (pick language, keyboard, time zone, etc...)
    • use a superuser name and password
    • partition the disk using the guided feature. Select LVM
    • install security updates automatically
    • select the OpenSSH server (leave all others unselected)
  • Start the Virtual Machine Hadoop100. It works!


Setup Network


  • Using virtualBox, Stop hadoop100
  • Right click on the machine and in the settings, go to Network
  • select Enable Network Adapter
  • select Bridged Adapter and pick eth0.
  • Using virtualBox restart hadoop100
  • from its terminal window, ping google.com and verify that site is reachable from the virtual server.
  • (get the machine's ip address using ifconfig -a if necessary)
  • Change server name by editing /etc/hostname and set to hadoop100
  • similarly, edit /etc/hosts to contain line 127.0.1.1 hadoop100


Setup Dynamic-Host Address


On another computer, connect to dyndns and register hadoop100 with this system.

  • on Hadoop0, install ddclient
   sudo apt-get install ddclient
  • and enter the full address of the server (e.g. hadoop100.webhop.net)
  • If you need to modify the information collected by the installation of ddclient, it can be found /etc/ddclient.conf
  • wait a few minutes and verify that hadoop100 is now reachable via ssh from a remote computer.


Install Hadoop


  • Install java
sudo apt-get install openjdk-6-jdk
java -version
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.5) (6b27-1.12.5-1ubuntu1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

  • Disable ipv6 (recommended by many users setting up Hadoop)
 sudo emacs -nw /etc/sysctl.conf 
and add these lines at the end:
  #disable ipv6  
  net.ipv6.conf.all.disable_ipv6 = 1  
  net.ipv6.conf.default.disable_ipv6 = 1  
  net.ipv6.conf.lo.disable_ipv6 = 1  

  • reboot hadoop100
  • Now follow directives in Michael Noll's excellent tutorial for building a 1-node Hadoop cluster (cached copy]).
  • create hduser user and hadoop group (already existed)
  • create ssh password-less entry
  • open /usr/local/hadoop/conf/hadoop-env.sh and set JAVA_HOME
  export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64/

  • Download hadoop from hadoop repository
  sudo -i
  cd /usr/local
  wget http://www.poolsaboveground.com/apache/hadoop/common/stable/hadoop-1.1.2.tar.gz
  tar -xzf hadoop-1.1.2.tar.gz
  mv hadoop-1.1.2 hadoop
  chown -R hduser:hadoop hadoop
  • setup user hduser by loging in a hduser
  su - hduser
  emacs .bashrc
and add
 # HADOOP SETINGS
 # Set Hadoop-related environment variables
 export HADOOP_HOME=/usr/local/hadoop

 # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
 #export JAVA_HOME=/usr/lib/jvm/java-6-sun
 export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

 # Some convenient aliases and functions for running Hadoop-related commands
 unalias fs &> /dev/null
 alias fs="hadoop fs"
 unalias hls &> /dev/null
 alias hls="fs -ls"

 export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin"
 export PATH="$PATH:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:."
This setting of .bashrc should be done by any user wanting to use hadoop.
  • edit /usr/local/hadoop/conf/hadoop-env.sh and set JAVA_HOME
 export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
  • Edit /usr/local/hadoop/conf/core-site.xml and add
  <property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
 </property>
 
 <property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
 </property>
  • Edit /usr/local/hadoop/conf/mapred-site.xml and add
<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
    • Edit /usr/local/hadoop/conf/hdfs-site.xml and add
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>
  • Format namenode. As user hduser enter the command:
  hadoop namenode -format
  • as super user run the following commands:
  sudo mkdir -p /app/hadoop/tmp
  sudo chown hduser:hadoop /app/hadoop/tmp
  sudo chmod 750 /app/hadoop/tmp

  sudo ls -n   /usr/local/hadoop/hadoop-examples-1.1.2.jar /usr/local/hadoop/hadoop-examples.jar  
  • back as hduser
  cd
  mkdir input
  cp /usr/local/hadoop/conf/*.xml input
  start-all.sh
  hadoop dfs copyFromLocal input/ input
  hadoop dfs -ls
  Found 1 item
  drwxr-xr-x   - hduser supergroup          0 2013-07-24 11:16 /user/hduser/input
  hadoop jar /usr/local/hadoop/hadoop-examples.jar wordcount /usr/hduser/input/ output
  13/07/24 11:24:46 INFO input.FileInputFormat: Total input paths to process : 7
  13/07/24 11:24:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library
  13/07/24 11:24:46 WARN snappy.LoadSnappy: Snappy native library not loaded
  13/07/24 11:24:47 INFO mapred.JobClient: Running job: job_201307241105_0006
  13/07/24 11:24:48 INFO mapred.JobClient:  map 0% reduce 0%
  13/07/24 11:24:57 INFO mapred.JobClient:  map 28% reduce 0%
  13/07/24 11:25:04 INFO mapred.JobClient:  map 42% reduce 0%
  .
  .
  .
  13/07/24 11:25:22 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1448595456
  13/07/24 11:25:22 INFO mapred.JobClient:     Reduce output records=805
  13/07/24 11:25:22 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5797994496
  13/07/24 11:25:22 INFO mapred.JobClient:     Map output records=3114

  hadoop dfs -ls 
  Found 2 items
  drwxr-xr-x   - hduser supergroup          0 2013-07-24 11:16 /user/hduser/input
  drwxr-xr-x   - hduser supergroup          0 2013-07-24 11:29 /user/hduser/output
 
  hadoop dfs -ls output
  Found 3 items
  -rw-r--r--   1 hduser supergroup          0 2013-07-24 11:29 /user/hduser/output/_SUCCESS
  drwxr-xr-x   - hduser supergroup          0 2013-07-24 11:28 /user/hduser/output/_logs
  -rw-r--r--   1 hduser supergroup      15477 2013-07-24 11:28 /user/hduser/output/part-r-00000

  hadoop dfs -cat output/part-r-00000
  "*"	11
  "_HOST"	1
  "alice,bob	11
  "local"	2
  'HTTP/'	1
  (-1).	2
  (1/4	1
  (maximum-system-jobs	2
  *not*	1
  ,	2
  ...




All set!


Add New User who will run Hadoop Jobs



Give the Right Permissions to New User for Hadoop


This is taken and adapted from http://amalgjose.wordpress.com/2013/02/09/setting-up-multiple-users-in-hadoop-clusters/

  • as superuser, add new user mickey to Ubuntu
  sudo useradd  -d /home/mickey -m mickey
  sudo passwd mickey
  • login as hduser
 hadoop fs –chmod -R  1777 /app/hadoop/tmp/mapred/   (this generates errors...)
 chmod 777 /app/hadoop/tmp
 hadoop fs -mkdir /user/mickey/
 hadoop fs -chown -R mickey:mickey /user/mickey/
  • login as mickey
  • change the default shell to bash
  chsh 
  • emacs .bashrc and add these lines at the end:
 # HADOOP SETINGS
 # Set Hadoop-related environment variables
 #export HADOOP_HOME=/usr/local/hadoop
 
 # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
 #export JAVA_HOME=/usr/lib/jvm/java-6-sun
 export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
 
 # Some convenient aliases and functions for running Hadoop-related commands
 unalias fs &> /dev/null
 alias fs="hadoop fs"
 unalias hls &> /dev/null
 alias hls="fs -ls"

 export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin"
 export PATH="$PATH:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:."

  • log out and back in as Mickey


Test


  • create input directory and copy some random text files there.
  mkdir input
  cp /usr/local/hadoop/conf/*.xml input
  hadoop dfs -copyFromLocal input/ input
  • run hadoop wordcount job on the files in the new directory to test the setup.
  hadoop jar /usr/local/hadoop/hadoop-examples.jar wordcount /usr/hduser/input/ output
  hadoop dfs -ls
  hadoop dfs -ls output
  hadoop dfs -cat output/part-r-00000
  "*"	11
  "_HOST"	1
  "alice,bob	11
  "local"	2
  'HTTP/'	1
  (-1).	2
  (1/4	1
  ...
  • It should work correctly and create a list of words and their frequency of occurrence.




Clone and Create a New Virtual Server


Clone and Initialize

We've found that cloning a server is not always a reliable way to copy all the users and applications that have been added to the initial Ubuntu server. Instead exporting a copy of the first server as an instance and importing it back as a new virtual server works well.

This work is done on the physical server, using the VirtualBox Manager. We'll use hadoop100 as Master for cloning

  • (deprecated) Before closing down hadoop100 before cloning it, connect via ssh to it, and remove the file /etc/udev/rules.d/70-persistent-net.rules (it will be recreated automatically. No need to move it under a different name).
  sudo rm -f /etc/udev/rules.d/70-persistent-net.rules

Not doing this will prevent the network interface from being brought up.
  • Close hadoop100 and power it off.
  • Click on the machine hadoop100 in VBox manager. Select Export Appliance in the File menu. Accept the default options and export it.
  • In the File menu, click on Import Appliance, and call it hadoop101. Give it 2GB or RAM.
  • Do not check box labeled "Reinitialize the MAC address of all network cards"
  • right click on new machine, pick settings and verify/adjust these quantities:
    • System: select 2GB RAM
    • Processor: 1CPU
    • Network: bridged adapter
  • start the newly cloned virtual machine (hadoop101)
  • it may complain that it is not finding the network, in which case login as the super user (the same as for hadoop100) and do this:
    1. Run this command in the terminal:
      ifconfig -a
    2. Notice the first Ethernet adapter id. It should be "eth?"
    3. edit the file at /etc/network/interface. Change the adapter (twice) from "eth0" to the adapter id in the previous step
    4. Save the file
    5. reboot
  • Reboot the virtual appliance if you have skipped the previous step, and connect to it in the terminal window provided by VirtualBox:
  • replace hadoop100 by hadoop101 in several files:
 sudo perl -pi.back -e 's/hadoop100/hadoop101/g;' /etc/hostname
 sudo perl -pi.back -e 's/hadoop100/hadoop101/g;' /etc/hosts
 sudo perl -pi.back -e 's/hadoop100/hadoop101/g;' /etc/ddcllient.conf
  • reboot to make the changes take effect:
 reboot
  • You should be able to remotely ssh to the new machine from another computer, now that ddclient is setup. You may have to wait a few minutes, though, for the dynamic host name to be updated on the Internet... 5 minutes or so?


Making the Clone run as a Hadoop 1-Node Cluster


In this section we make sure the new virtual server can work as an 1-node Hadoop cluster, and we repeat the steps from above.

The clone should contain all the hadoop setup we did for hadoop100. We'll just clean it up a bit.

  • run these commands in a terminal (you should be able to ssh to your new clone by now). We assume that the new clone is named hadoop101.
    sudo perl -pi.back -e 's/hadoop100/hadoop101/g;'  /etc/hadoop/hdfs-site.xml
   su - hduser
   stop-all.sh        (nothing should be running, but just to be safe...)
   cd /app/hadoop/tmp/
   rm -rf  *          (remove the temp directory for hadoop.  It will recreate it on its own)
   hadoop namenode -format      (reformat the name node)

   13/07/25 12:31:53 INFO namenode.NameNode: STARTUP_MSG: 
   /************************************************************
   STARTUP_MSG: Starting NameNode
   STARTUP_MSG:   host = hadoop101/127.0.1.1
   .
   .
   .
   13/07/25 12:31:54 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted.
   13/07/25 12:31:54 INFO namenode.NameNode: SHUTDOWN_MSG: 
   /************************************************************
   SHUTDOWN_MSG: Shutting down NameNode at hadoop101/127.0.1.1
   ************************************************************/
   
   start-all.sh
  • The input folder should already exist in the hduser home directory, so we need to copy it again to the hadoop file-system:
   hadoop dfs -copyFromLocal input/ input
  • Now we should be able to run our wordcount test:
   hadoop jar /usr/local/hadoop/hadoop-examples.jar wordcount /user/hduser/input/ output
  13/07/25 12:40:56 INFO input.FileInputFormat: Total input paths to process : 7
  13/07/25 12:40:56 INFO util.NativeCodeLoader: Loaded the native-hadoop library
  13/07/25 12:40:56 WARN snappy.LoadSnappy: Snappy native library not loaded
  13/07/25 12:40:57 INFO mapred.JobClient: Running job: job_201307251237_0003
  .
  .
  .
  13/07/25 12:42:20 INFO mapred.JobClient:     Reduce output records=805
  13/07/25 12:42:20 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=6065721344
  13/07/25 12:42:20 INFO mapred.JobClient:     Map output records=3114

  • Verify that we have a result file (feel free to check its contents):
  hadoop dfs -ls output
  Found 3 items
  -rw-r--r--   1 hduser supergroup          0 2013-07-25 12:42 /user/hduser/output/_SUCCESS
  drwxr-xr-x   - hduser supergroup          0 2013-07-25 12:40 /user/hduser/output/_logs
  -rw-r--r--   1 hduser supergroup      15477 2013-07-25 12:42 /user/hduser/output/part-r-00000
  • This verifies that our newly cloned server can run hadoop jobs as a 1-node cluster.



Create a 2-Node Virtual Hadoop Cluster

  • Follow the excellent tutorial by Michael Noll on how to setup a 2-node cluster (cached copy]). Simpley use hadoop100 whenever Noll uses master and hadoop101 whenever he uses slave. The hadoop environment where files need to be modified is in /usr/local/hadoop, and NOT in /etc/hadoop.
  • Very important: comment out the second line of /etc/hosts on the master as follows:
 #127.0.1.1	hadoop100
as suggested in http://stackoverflow.com/questions/8872807/hadoop-datanodes-cannot-find-namenode, otherwise the slave cannot connect to the master that somehow binds to the wrong address/port when trying to listen for the slave, generating entries of the form INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/... in the log file of the slave...
  • Cleanup the hadoop environment in both hadoop100 and hadoop101:
      su - hduser
      stop-all.sh          (just in case...)
      rm -rf /app/hadoop/tmp/* 

  • On hadoop100 (the master), recreate a clean namenode:
      hadoop namenode -format

  • Start the hadoop system on hadoop100 (the master):


      start-dfs.sh
      start-mapred.sh

  • Assuming that the input folder with dummy files is still in the hduser directory, copy it to the hadoop file system:
      cd
      hadoop dfs -copyFromLocal input/ input

  • Run the test:
    hadoop jar /usr/local/hadoop/hadoop-examples.jar wordcount /user/hduser/input/ output
    13/07/25 16:04:54 INFO input.FileInputFormat: Total input paths to process : 7
    13/07/25 16:04:54 INFO util.NativeCodeLoader: Loaded the native-hadoop library
    13/07/25 16:04:54 WARN snappy.LoadSnappy: Snappy native library not loaded
    13/07/25 16:04:55 INFO mapred.JobClient: Running job: job_201307251604_0001
    13/07/25 16:04:56 INFO mapred.JobClient:  map 0% reduce 0%
    13/07/25 16:05:06 INFO mapred.JobClient:  map 28% reduce 0%
    .
    .
    .
    13/07/25 16:06:34 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=6066085888
    13/07/25 16:06:34 INFO mapred.JobClient:     Map output records=3114



Food for thoughts

  • there should be an easy way to create script to backup the virtual servers automatically, say via cron jobs. The discussion here is a good place to start...