Hadoop Tutorial 2.1 -- Streaming XML Files

From DftWiki

Jump to: navigation, search

--D. Thiebaut 02:52, 13 April 2010 (UTC)



This tutorial is the continuation of Tutorial 2, and uses streaming to process XML files as a block. In this setup each Map task gets a whole xml file and breaks it down into tuples.

This page has been translated in

The Setup


The setup is simple:

  • we have a lot of xml files, similar to the one shown above. They are delimited by <xml> and </xml> at the beginning and end of the contents.
  • One field of interest is sandwiched between <title> tags
  • The other field of interest is sandwiched between <text> tags.
  • The mapper function (mapper.py) gets the whole file as input (as opposed as the default of feeding only one line of the file to the map function, as in the wordcount program of Tutorial 2).
  • The mapper outputs a tuple with the title string as key and a shortened version of the text string as value.
  • The reducer is the Identity function and outputs what it receives.

Input Files

  • Set yourself up in a new directory
  • Create a few files with the XML format shown above. Call them file1.xml, file2.xml and file3.xml.
  • Create a subdirectory in your HDFS directory (we use dft here as the user directory. Adjust and replace with your own name/initials).
  hadoop dfs -mkdir dft/xml
  • Copy the xml files from local storage to HDFS
  hadoop dfs -copyFromLocal file*.xml dft/xml

The Mapper program

Create a mapper.py program in your working directory.

#!/usr/bin/env python
# mapper.py
# D. Thiebaut
# takes a complete xml file, extract the title and
# text part of the file, and outputs a tuple
# <title \t text> where text is shortened to just a few
# characters.
import sys
list = []
title = "Unknown"
inText = False
for line in sys.stdin:
    line = line.strip()
    if line.find( "<title>" )!= -1:
        title = line[ len( "<title>" ) : -len( "</title>" ) ]
    if line.find( "<text>" ) != -1:
        inText = True
    if line.find( "</text>" ) != -1:
        inText = False
    if inText:
        list.append( line )

text = ' '.join( list )
text = text[0:10] + "..." + text[-10:]
print '[[%s]]\t[[%s]]' % (title, text)

  • Make it executable
 chmod a+x mapper.py


  • Test the mapper on one of the xml files:
  cat file1.xml | ./mapper.py
Verify that you get the correct title and shortened text.

The Reducer program

#!/usr/bin/env python
import sys

for line in sys.stdin:

    line = line.strip()
    title, page = line.split('\t', 1)
    print '%s\t%s'   %   ( title, page )

  • Create a file called reducer.py in your working directory with the code above.
  • Make it executable
  chmod a+x ./reducer.py


  • Test it by feeding it the output of the mapper:
  cat file1.xml | ./mapper.py | ./reducer.py
  • Verify that you get the same output as the output of the mapper in the mapper test section above.

Running the Hadoop Streaming Program


The key to this whole lab is the way we start the hadoop job. The information is available on Apache's Hadoop documentation/Faq page. All we have to do is to specify the following switch in the hadoop jar

  -inputreader "StreamXmlRecordReader,begin=xml,end=/xml" 

It specifies that the InputReader should be the StreamXmlRecordReader, and that it should feed the mapper class all the information sandwiched between <xml> and </xml>.

Running the job

  • Run your hadoop streaming mapper/reducer pair, and use the -inputreader switch:
   hadoop jar /home/hadoop/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar  \
               -inputreader "StreamXmlRecordReader,begin=xml,end=/xml"  \
               -file ./mapper.py -mapper ./mapper.py \
               -file ./reducer.py -reducer ./reducer.py  \
               -input dft/xml -output dft-out
Remember to replace dft in the command above by your own initials/name!
  • Verify that your program progresses through your data files:
hadoop jar /home/hadoop/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar -inputreader \
             "StreamXmlRecordReader,begin=xml,end=/xml"  -file ./mapper.py -mapper ./mapper.py \
             -file ./reducer.py -reducer ./reducer.py  -input dft/xml -output dft-out
packageJobJar: [./mapper.py, ./reducer.py, /tmp/hadoop/hadoop-unjar3287253995728872117/] [] \ 
             /tmp/streamjob1960257454356309241.jar tmpDir=null
10/04/12 21:49:20 INFO mapred.FileInputFormat: Total input paths to process : 4
10/04/12 21:49:20 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop/mapred/local]
10/04/12 21:49:20 INFO streaming.StreamJob: Running job: job_201004011119_0108
10/04/12 21:49:20 INFO streaming.StreamJob: To kill this job, run:
10/04/12 21:49:20 INFO streaming.StreamJob: /home/hadoop/hadoop/bin/../bin/hadoop job  \
             -Dmapred.job.tracker=hadoop1:9001 -kill job_201004011119_0108
10/04/12 21:49:20 INFO streaming.StreamJob: Tracking URL: http://hadoop1:50030/jobdetails.jsp? \
10/04/12 21:49:21 INFO streaming.StreamJob:  map 0%  reduce 0%
10/04/12 21:49:25 INFO streaming.StreamJob:  map 75%  reduce 0%
10/04/12 21:49:26 INFO streaming.StreamJob:  map 100%  reduce 0%
10/04/12 21:49:35 INFO streaming.StreamJob:  map 100%  reduce 100%
10/04/12 21:49:36 INFO streaming.StreamJob: Job complete: job_201004011119_0108
10/04/12 21:49:36 INFO streaming.StreamJob: Output: dft-out

hadoop@hadoop6:~/352/dft/readXml$ hadoop dfs -ls dft-out/
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2010-04-12 21:49 /user/hadoop/dft-out/_logs
-rw-r--r--   2 hadoop supergroup        233 2010-04-12 21:49 /user/hadoop/dft-out/part-00000

hadoop@hadoop6:~/352/dft/readXml$ hadoop dfs -text dft-out/part-00000
  • Verify that you get the correct output.

Lab Experiment #1
The Mapper function, as presented here, assumes that the file that is given contains only one set of <xml>...</xml> tags. Modify it so that it can treat files containing several wikipages contained each in its own set of <xml>...</xml> tags.
Generate some xml files that contain several <xml>...</xml> tags, and test/verify that your new mapper, which is more general and hence more desirable, works well.