CSC352 Hadoop Howto & FAQ

From CSclasswiki
Jump to: navigation, search

This page contains recipes and answers to questions relating to using Hadoop and MapReduce.

HOWTO/FAQ

Howto: control the number of Reduce tasks

--Thiebaut 14:06, 1 April 2010 (UTC)

When using streaming and a python program, use the -jobconf switch as shown here:

hadoop jar /home/hadoop/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar \
       -jobconf mapred.reduce.tasks=16 \
       -file ./mapper.py \
       -mapper ./mapper.py \
       -file ./reducer.py \
       -reducer ./reducer.py \  
       -input dft -output dft-output

Howto: use XML input with Java

--Yli2 21:17, 23 April 2010 (UTC)

In the default text input mode, Hadoop only feeds one line to each map function call. To parse input based on xml tags, we can use a customized InputFormat class.

package org.hadoop.hadoop;
  • Import XmlInputFormat to your MapReduce class. For example, add the following line:
import org.hadoop.hadoop.XmlInputFormat
  • Add the following lines after conf.setReducerClass(Reduce.class); inside run(). Note that <xml> and </xml> can be replaced by any xml tags.
conf.set("xmlinput.start","<xml>");
conf.set("xmlinput.end","</xml>");
conf.setInputFormat(XmlInputFormat.class);
  • Compile XmlInputFormat.java with your other Java classes
javac -classpath /home/hadoop/hadoop/hadoop-0.19.2-core.jar -d MyMapReduce_classes MyMapReduce.java XmlInputFormat.java
jar -cvf MyMapReduce.jar -C MyMapReduce_classes/ .
  • Run Hadoop with the jar file as usual.

Warning: This simple approach may results in java.lang.OutOfMemoryError if the input record is too large. It is better to use Streaming with XmlRecordReader in this case.