Wednesday, January 29, 2014

How to setup a single-node Hadoop cluster (YARN/MRv2)

This tutorial shows you how to set up and configure a single-node Hadoop cluster to run both YARN and MapReduce jobs so that you can perform some simple operations or debugs in HDFS. YARN (yet another resource negociator) is the resouce manager that takes care of allocating containers where jobs can run using the data stored in HDFS. HDFS is a distributed fife system.

The steps are easy, as it is shown in the following flowchart:

Prerequisites:
  • GNU Linux (I use CetnOS-6.4) 
  • JDK 1.7.x

Download Hadoop and Install:
Download a recent release from http://apache.mirror.nexicom.net/hadoop/common/stable/

Install Hadoop is easy, uncompress the tarball to a location (I use ~/opt/).
$ mv hadoop-2.2.0.tar.gz ~/opt/
$ cd ~/opt
$ tar -xzvf hadoop-2.2.0.tar.gz

To make following steps eaiser, let's set up some environmental variables:
$ vi ~/.bashrc
add
# Hadoop env virables
export HADOOP_PREFIX="/home/lxu/opt/hadoop-2.2.0/"
export HADOOP_HOME=$HADOOP_PREFIX
export HADOOP_COMMON_HOME=$HADOOP_PREFIX
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_PREFIX
export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
export HADOOP_YARN_HOME=$HADOOP_PREFIX
$ . ~/.bashrc

$ vi ~/.bash_profile
PATH=$JAVA_HOME/bin:$PATH:$HOME/bin:$M2:/opt/abook-0.5.6/bin:/opt/vuze:/home/lxu/opt/hadoop-2.2.0/bin:/home/lxu/opt/hadoop-2.2.0/sbin:
export PATH
$ . ~/.bash_profile

Hadoop mode:
Hadoop has 3 modes: Standalone (local) mode, Pseudo-distributed mode and Fully distributed mode.
  1. Standalone (or local) mode: There are no daemons running in this mode. When you do a JPS on your terminal, there would be no Job tracker, Name node or other daemons running. Hadoop just uses its local files system as an substitue for hdfs files system. 
  2. Pseudo-distributed mode: All daemons runs on a single machine and it mimics the behaviour of cluster. All the daemons runs in your machine locally using the hdfs protocol. 
  3. Fully distributed mode: This is the kind of environment you will usually find on test, prod and qa grids. It was 100 of machines with some equal number of cores and the true power of hadoop. As an application developer you would not settup this machine. Its usually the admins folks who set this up. 
"Fully distributed mode" is normally used for final job running, "Standalone mode" and "Pseudo-distributed mode" is for testing and debugging.

Before we start:
# vi /opt/hadoop-2.2.0/etc/hadoop/hadoop-env.sh
Update JAVA_HOME to:
# The java implementation to use.
export JAVA_HOME="/usr/lib/jvm/jdk1.7.0"

Also setup passphraseless ssh:
Now check that you can ssh to the localhost without a passphrase:
$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen
$ ssh-copy-id user@localhost
Now try the following command:
# cd /opt/hadoop-2.2.0
# bin/hadoop
#
Usage: hadoop [--config confdir] COMMAND
       where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
 or
  CLASSNAME            run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

You are ready to configure and start your Hadoop cluster.

Standalone mode:
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.

The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
$ cd /tmp
$ mkdir input
$ cp /opt/hadoop-2.2.0/etc/hadoop/*.xml /tmp/input/
$ /opt/hadoop-2.2.0/bin/hadoop jar ~/opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar grep input output 'dfs[a-z.]+'
$ cat output/*
1 dfsadmin

Pseudo-Distributed mode:
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.

To run as Pseudo-Distributed mode, we need to configure HDFS and YARN.

Confiure HDFS:

There are two files you need to configure: hdfs-site.xml and core-site.xml. Both files are located at $HADOOP_PREFIX/etc/hadoop/.

hdfs-site.xml is the Hadoop configuration XML file, it contains configuration settings for HDFS daemons: the namenode, the secondary namenode, and the datanodes. The hdfs-site.xml file should be empty, change it to:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///home/lxu/opt/hadoop-2.2.0/hdfs/datanode</value>
        <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///home/lxu/opt/hadoop-2.2.0/hdfs/namenode</value>
        <description>Path on the local filesystem where the NameNode stores the namespace and transaction logs persistently.</description>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
        <description>Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
        </description>
    </property>
</configuration>

You should change "///home/lxu" to your own path.

core-site.xml is also a Hadoop configuration XML file. It contains configuration settings for Hadoop Core, such as I/O settings that are common to HDFS and MapReduce. The core-site.xml should be empty, change it to:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost/</value>
        <description>NameNode URI</description>
    </property>
</configuration>

Configure YARN:
YARN is also called MapReduce 2.0(MRv2). The fundamental idea of MRv2 is to split up the two major functionalities of JobTracker, recource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).

To conifgure YARN, we need to make changes to yarn-site.xml file. This file is located in $HADOOP_PREFIX/etc/hadoop/. It is should be empty by default. Change yarn-site.xml to:

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>128</value>
        <description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>2048</value>
        <description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description>
    </property>
    <property>
        <name>yarn.scheduler.minimum-allocation-vcores</name>
        <value>1</value>
        <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-vcores</name>
        <value>2</value>
        <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description>
    </property>
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>4096</value>
        <description>Physical memory, in MB, to be made available to running containers</description>
    </property>
    <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>2</value>
        <description>Number of CPU cores that can be allocated for containers.</description>
    </property>
    <property>
    <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
        <description>shuffle service that needs to be set for Map Reduce to run </description>
    </property>
</configuration>

Notice that the property "yarn.nodemanager.aux-services" is not necessary if you are not going to run and MapRecude Jobs.

Start your Hadoop single node:
Now that we just finished configuring HDFS and YARN, we can star up the single node cluster now:

We need to format namenode, for the first time only
$ hdfs namenode -format

Start namenode and datanode:
$ hadoop-daemon.sh start namenode
starting namenode, logging to /home/lxu/opt/hadoop-2.2.0//logs/hadoop-lxu-namenode-tony.xu.com.out
$ hadoop-daemon.sh start datanode
starting datanode, logging to /home/lxu/opt/hadoop-2.2.0//logs/hadoop-lxu-datanode-tony.xu.com.out
Start YARN:
$ yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /home/lxu/opt/hadoop-2.2.0//logs/yarn-lxu-resourcemanager-tony.xu.com.out
$ yarn-daemon.sh start nodemanager
starting nodemanager, logging to /home/lxu/opt/hadoop-2.2.0//logs/yarn-lxu-nodemanager-tony.xu.com.out

Test the single node Hadoop cluster works fine:
$HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.2.0.jar org.apache.hadoop.yarn.applications.distributedshell.Client --jar $HADOOP_PREFIX/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.2.0.jar --shell_command date --num_containers 2 --master_memory 1024

This example will spawn a specified number of containers and runs a shell commmand in each of them. With the above command, we are telling hadoop to run client class in the hadoop-yarn-applications-distributedshell-2.2.0.jar, passing it the jar containing the definition of the ApplicationMaster (the same jar), the shell command to run in each of the hosts (date), the number of containers to spawn (2) and the memory used by the ApplicationMaster (1024MB). The value of 1024 was set empirically by trying to run the program several times until it stopped failing due to the ApplicationMaster using more memory than that which had been allocated to it.

Configure MapReduce:
To configure to run YARN-enabled MapReduce jobs, we need to configure MapReducev2. It is easy to configure, we just need to setup some reasonable defaults for the memory and CPU requirements in mapred-site.xmlto match those we defined for the YARN containers previously:

Create a mapred-site.xml from template:
$ cp mapred-site.xml.template mapred-site.xml

Change it contents to:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>1024</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.command-opts</name>
        <value>-Xmx768m</value>
    </property>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
        <description>Execution framework.</description>
    </property>
    <property>
        <name>mapreduce.map.cpu.vcores</name>
        <value>1</value>
        <description>The number of virtual cores required for each map task.</description>
    </property>
    <property>
        <name>mapreduce.reduce.cpu.vcores</name>
        <value>1</value>
        <description>The number of virtual cores required for each map task.</description>
    </property>
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>1024</value>
        <description>Larger resource limit for maps.</description>
    </property>
    <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx768m</value>
        <description>Heap-size for child jvms of maps.</description>
    </property>
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>1024</value>
        <description>Larger resource limit for reduces.</description>
    </property>
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx768m</value>
        <description>Heap-size for child jvms of reduces.</description>
    </property>
    <property>
        <name>mapreduce.jobtracker.address</name>
        <value>jobtracker.alexjf.net:8021</value>
    </property>
</configuration>

Add "yarn.nodemanager.aux-services" in yarn-site.xml:
We need to setup a prerequisite for MapReduce in yarn-site.xml, add
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    <description>shuffle service that needs to be set for Map Reduce to run </description>
  </property>

if you don't have it in yarn-site.xml

Then restart resoucemanager and nodemanager:
$ yarn-daemon.sh stop nodemanager
$ yarn-daemon.sh stop resourcemanager
$ yarn-daemon.sh start resourcemanager
$ yarn-daemon.sh start nodemanager

To test:
$HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar randomwriter out
14/01/29 10:10:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/01/29 10:10:52 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Running 10 maps.
.....
RECORDS_WRITTEN=1021881
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=10772846294
Job ended: Wed Jan 29 10:14:17 EST 2014
The job took 203 seconds.

Monitor jobs:
You can monitor job status in your web browser, by pointing it to http://localhost:8088

To get a list of Hadoop daemons running, use jps:
$ jps
4471 NodeManager
4572 org.eclipse.equinox.launcher_1.3.0.v20130327-1440.jar
3664 DataNode
4072
4168 ResourceManager
3858 NameNode



No comments: