Thursday, January 23, 2014

How to configure Hadoop 2.x in Eclipse

This blog shows you how to configure Hadoop 2.x in your Eclipse IDE.

My Environment:
OS: CentOS release 6.4 (Final)
Java: JDK1.7.0_04
Eclipse: eclipse-jee-kepler-SR1-linux-gtk-x86_64.tar.gz
Hadoop: hadoop-2.2.0.tar.gz

Download and install Eclipse:
1. Download Eclipse from:
http://www.eclipse.org/downloads/

2. Untar the downloaded tar file:
# tar -xvf eclipse-jee-indigo-SR2-linux-gtk-x86_64.tar.gz

3. Install it in /opt
# mv eclipse /opt/eclipse-kp

4. Create a launcher on your Desktop

5. Create a workspace:
$ mkdir /home/lxu/workspace/eclipse-kp

6. Launch Eclipse.


Setting Up Eclipse:

  1. First, we need to set a couple of classpath variables so Eclipse can find the dependencies.
  2. Go to Window -> Preferences.
  3. Go to Java -> Build Path -> Classpath Variables.
  4. Add a new entry with name ANT_PATH and path set to the ant home on your machine, typically /usr/share/ant (or whereever your ant is installed).
  5. Add another new entry with name M2_REPO and path set to your maven repository, typically $HOME/.m2/repository (e.g. /home/user/.m2/repository).



Hadoop requires tools.jar, which is under JDK_HOME/lib. Because it is possible Eclipse won’t pick this up:

  1. Go to Window->Preferences->Java->Installed JREs.
  2. Select the right Java version from the list, and click “Edit”.
  3. In the pop-up, “Add External JARs”, navigate to “JDK_HOME/lib”, and add “tools.jar”.


Hadoop uses a particular style of formatting. When contributing to the project, you are required to follow the style guidelines: Java formatting with all spaces and indentation as well as tabs set to 2 spaces. To do that:

  1. Download the Formatter (https://raw.github.com/cloudera/blog-eclipse/master/hadoop-format.xml)
  2. Go to Window -> Preferences.
  3. Go to Java->Code Style -> Formatter.
  4. Import this Formatter.
  5. It is a good practice to enable automatic formatting of the modified code when you save a file. To do that, go to Window->Preferences->Java->Editor->Save Actions and select “Perform the selected actions on save”, “Format source code”, “Format edited lines”. Also, de-select “Organize imports”.
  6. Install the m2e plugin. Go to Help -> Install New Software. Enter “http://download.eclipse.org/technology/m2e/releases” into the “Work with” box and select  the m2e plugins and install them.

Configure Hadoop 2.X
1. Download Hadoop sources using svn/git and checkout appropriate branch or download release srouce tarballs (http://apache.sunsite.ualberta.ca/hadoop/common/stable/)
# tar -xzvf hadoop-2.2.0.tar.gz

2. Create a new Java project in Eclipse. Go to "File" -> "New" -> "Project" -> "Java Project".

3. Give a project name, I use "Hadoop". Make sure you use the correct JRE. Then click "Finish".


4. Import Hadoop Jar files. Since we are going to write some MapReduce Jobs, we need to include the following jar files:

5. Right click on "JRE System Library", choose "Build Path" -> "Configure Build Path". Then choose "Add Exrernal JARs...". After import external JARs, your Eclipse interface should look like this:



Write a MapReduce Job
1. Create a package in src:


2. Create WordCount Java class:


3. Use the following srouce code:
package org.myorg;

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

public class WordCount {
  public static class Map extends MapReduceBase implements
      Mapper<LongWritable, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value,
        OutputCollector<Text, IntWritable> output, Reporter reporter)
        throws IOException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        output.collect(word, one);
      }
    }
  }

  public static class Reduce extends MapReduceBase implements
      Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterator<IntWritable> values,
        OutputCollector<Text, IntWritable> output, Reporter reporter)
        throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      }
      output.collect(key, new IntWritable(sum));
    }
  }

  public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(Map.class);
    conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    JobClient.runJob(conf);
  }
}

4. Generate the WordCount.jar file:



5. Now you should have the jar file:

6. Upload your WordCount.jar file to one your of Hadoop client server and run the jar file.

7. On your Hadoop client server:
$ echo "Hello World Bye World" > file0
$ echo "Hello Hadoop Goodbye Hadoop" > file1
$ hadoop fs -put file* /user/tony/wordcount/input
$ hadoop jar WordCount.jar org.myorg.WordCount /user/tony/wordcount/input /user/tony/wordcount/output
$ hadoop jar WordCount.jar /user/tony/wordcount/input /user/tony/wordcount/output

14/01/23 11:14:21 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/01/23 11:14:21 INFO mapred.FileInputFormat: Total input paths to process : 2
14/01/23 11:14:21 INFO mapred.JobClient: Running job: job_201312181623_0523
14/01/23 11:14:22 INFO mapred.JobClient:  map 0% reduce 0%
14/01/23 11:14:30 INFO mapred.JobClient:  map 100% reduce 0%
14/01/23 11:14:36 INFO mapred.JobClient:  map 100% reduce 100%
14/01/23 11:14:38 INFO mapred.JobClient: Job complete: job_201312181623_0523
14/01/23 11:14:38 INFO mapred.JobClient: Counters: 33
14/01/23 11:14:38 INFO mapred.JobClient:   File System Counters
14/01/23 11:14:38 INFO mapred.JobClient:     FILE: Number of bytes read=313
14/01/23 11:14:38 INFO mapred.JobClient:     FILE: Number of bytes written=2538349
14/01/23 11:14:38 INFO mapred.JobClient:     FILE: Number of read operations=0
14/01/23 11:14:38 INFO mapred.JobClient:     FILE: Number of large read operations=0
14/01/23 11:14:38 INFO mapred.JobClient:     FILE: Number of write operations=0
14/01/23 11:14:38 INFO mapred.JobClient:     HDFS: Number of bytes read=365
14/01/23 11:14:38 INFO mapred.JobClient:     HDFS: Number of bytes written=41
14/01/23 11:14:38 INFO mapred.JobClient:     HDFS: Number of read operations=18
14/01/23 11:14:38 INFO mapred.JobClient:     HDFS: Number of large read operations=0
14/01/23 11:14:38 INFO mapred.JobClient:     HDFS: Number of write operations=24
14/01/23 11:14:38 INFO mapred.JobClient:   Job Counters
14/01/23 11:14:38 INFO mapred.JobClient:     Launched map tasks=3
14/01/23 11:14:38 INFO mapred.JobClient:     Launched reduce tasks=12
14/01/23 11:14:38 INFO mapred.JobClient:     Data-local map tasks=3
14/01/23 11:14:38 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=11392
14/01/23 11:14:38 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=46430
14/01/23 11:14:38 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/01/23 11:14:38 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/01/23 11:14:38 INFO mapred.JobClient:   Map-Reduce Framework
14/01/23 11:14:38 INFO mapred.JobClient:     Map input records=2
14/01/23 11:14:38 INFO mapred.JobClient:     Map output records=8
14/01/23 11:14:38 INFO mapred.JobClient:     Map output bytes=82
14/01/23 11:14:38 INFO mapred.JobClient:     Input split bytes=312
14/01/23 11:14:38 INFO mapred.JobClient:     Combine input records=8
14/01/23 11:14:38 INFO mapred.JobClient:     Combine output records=6
14/01/23 11:14:38 INFO mapred.JobClient:     Reduce input groups=5
14/01/23 11:14:38 INFO mapred.JobClient:     Reduce shuffle bytes=649
14/01/23 11:14:38 INFO mapred.JobClient:     Reduce input records=6
14/01/23 11:14:38 INFO mapred.JobClient:     Reduce output records=5
14/01/23 11:14:38 INFO mapred.JobClient:     Spilled Records=12
14/01/23 11:14:38 INFO mapred.JobClient:     CPU time spent (ms)=11630
14/01/23 11:14:38 INFO mapred.JobClient:     Physical memory (bytes) snapshot=4309741568
14/01/23 11:14:38 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=25619640320
14/01/23 11:14:38 INFO mapred.JobClient:     Total committed heap usage (bytes)=7556431872
14/01/23 11:14:38 INFO mapred.JobClient:   org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
14/01/23 11:14:38 INFO mapred.JobClient:     BYTES_READ=50

No comments: