CSC352 Hadoop Howto & FAQ
This page contains recipes and answers to questions relating to using Hadoop and MapReduce.
Contents
HOWTO/FAQ
Howto: control the number of Reduce tasks
--Thiebaut 14:06, 1 April 2010 (UTC)
When using streaming and a python program, use the -jobconf switch as shown here:
hadoop jar /home/hadoop/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar \ -jobconf mapred.reduce.tasks=16 \ -file ./mapper.py \ -mapper ./mapper.py \ -file ./reducer.py \ -reducer ./reducer.py \ -input dft -output dft-output
Howto: use XML input with Java
--Yli2 21:17, 23 April 2010 (UTC)
In the default text input mode, Hadoop only feeds one line to each map function call. To parse input based on xml tags, we can use a customized InputFormat class.
- Make a copy of XmlInputFormat.java (by mahout) in your working directory.
- Modify the package header (Line 18). For example,
package org.hadoop.hadoop;
- Import XmlInputFormat to your MapReduce class. For example, add the following line:
import org.hadoop.hadoop.XmlInputFormat
- Add the following lines after conf.setReducerClass(Reduce.class); inside run(). Note that <xml> and </xml> can be replaced by any xml tags.
conf.set("xmlinput.start","<xml>"); conf.set("xmlinput.end","</xml>"); conf.setInputFormat(XmlInputFormat.class);
- Compile XmlInputFormat.java with your other Java classes
javac -classpath /home/hadoop/hadoop/hadoop-0.19.2-core.jar -d MyMapReduce_classes MyMapReduce.java XmlInputFormat.java jar -cvf MyMapReduce.jar -C MyMapReduce_classes/ .
- Run Hadoop with the jar file as usual.
Warning: This simple approach may results in java.lang.OutOfMemoryError if the input record is too large. It is better to use Streaming with XmlRecordReader in this case.