The steps are easy, as it is shown in the following flowchart:
Prerequisites:
- GNU Linux (I use CetnOS-6.4)
- JDK 1.7.x
Download Hadoop and Install:
Download a recent release from http://apache.mirror.nexicom.net/hadoop/common/stable/
Install Hadoop is easy, uncompress the tarball to a location (I use ~/opt/).
$ mv hadoop-2.2.0.tar.gz ~/opt/ $ cd ~/opt $ tar -xzvf hadoop-2.2.0.tar.gz
To make following steps eaiser, let's set up some environmental variables:
$ vi ~/.bashrc add # Hadoop env virables export HADOOP_PREFIX="/home/lxu/opt/hadoop-2.2.0/" export HADOOP_HOME=$HADOOP_PREFIX export HADOOP_COMMON_HOME=$HADOOP_PREFIX export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop export HADOOP_HDFS_HOME=$HADOOP_PREFIX export HADOOP_MAPRED_HOME=$HADOOP_PREFIX export HADOOP_YARN_HOME=$HADOOP_PREFIX $ . ~/.bashrc $ vi ~/.bash_profile PATH=$JAVA_HOME/bin:$PATH:$HOME/bin:$M2:/opt/abook-0.5.6/bin:/opt/vuze:/home/lxu/opt/hadoop-2.2.0/bin:/home/lxu/opt/hadoop-2.2.0/sbin: export PATH $ . ~/.bash_profile
Hadoop mode:
Hadoop has 3 modes: Standalone (local) mode, Pseudo-distributed mode and Fully distributed mode.
- Standalone (or local) mode: There are no daemons running in this mode. When you do a JPS on your terminal, there would be no Job tracker, Name node or other daemons running. Hadoop just uses its local files system as an substitue for hdfs files system.
- Pseudo-distributed mode: All daemons runs on a single machine and it mimics the behaviour of cluster. All the daemons runs in your machine locally using the hdfs protocol.
- Fully distributed mode: This is the kind of environment you will usually find on test, prod and qa grids. It was 100 of machines with some equal number of cores and the true power of hadoop. As an application developer you would not settup this machine. Its usually the admins folks who set this up.
"Fully distributed mode" is normally used for final job running, "Standalone mode" and "Pseudo-distributed mode" is for testing and debugging.
Before we start:
# vi /opt/hadoop-2.2.0/etc/hadoop/hadoop-env.sh
Update JAVA_HOME to:
# The java implementation to use.
export JAVA_HOME="/usr/lib/jvm/jdk1.7.0"
Also setup passphraseless ssh:
Now check that you can ssh to the localhost without a passphrase:
$ ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen $ ssh-copy-id user@localhost
Now try the following command:
# cd /opt/hadoop-2.2.0 # bin/hadoop # Usage: hadoop [--config confdir] COMMAND where COMMAND is one of: fs run a generic filesystem user client version print the version jar <jar> run a jar file checknative [-a|-h] check native hadoop and compression libraries availability distcp <srcurl> <desturl> copy file or directories recursively archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive classpath prints the class path needed to get the Hadoop jar and the required libraries daemonlog get/set the log level for each daemon or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters.
You are ready to configure and start your Hadoop cluster.
Standalone mode:
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
Pseudo-Distributed mode:
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
To run as Pseudo-Distributed mode, we need to configure HDFS and YARN.
Standalone mode:
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
$ cd /tmp $ mkdir input $ cp /opt/hadoop-2.2.0/etc/hadoop/*.xml /tmp/input/ $ /opt/hadoop-2.2.0/bin/hadoop jar ~/opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar grep input output 'dfs[a-z.]+' $ cat output/* 1 dfsadmin
Pseudo-Distributed mode:
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
To run as Pseudo-Distributed mode, we need to configure HDFS and YARN.
Confiure HDFS:
There are two files you need to configure: hdfs-site.xml and core-site.xml. Both files are located at $HADOOP_PREFIX/etc/hadoop/.
hdfs-site.xml is the Hadoop configuration XML file, it contains configuration settings for HDFS daemons: the namenode, the secondary namenode, and the datanodes. The hdfs-site.xml file should be empty, change it to:
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.datanode.data.dir</name> <value>file:///home/lxu/opt/hadoop-2.2.0/hdfs/datanode</value> <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///home/lxu/opt/hadoop-2.2.0/hdfs/namenode</value> <description>Path on the local filesystem where the NameNode stores the namespace and transaction logs persistently.</description> </property> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> </configuration>
You should change "///home/lxu" to your own path.
core-site.xml is also a Hadoop configuration XML file. It contains configuration settings for Hadoop Core, such as I/O settings that are common to HDFS and MapReduce. The core-site.xml should be empty, change it to:
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost/</value> <description>NameNode URI</description> </property> </configuration>
Configure YARN:
YARN is also called MapReduce 2.0(MRv2). The fundamental idea of MRv2 is to split up the two major functionalities of JobTracker, recource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).
To conifgure YARN, we need to make changes to yarn-site.xml file. This file is located in $HADOOP_PREFIX/etc/hadoop/. It is should be empty by default. Change yarn-site.xml to:
<?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> <description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> <description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> <description>Physical memory, in MB, to be made available to running containers</description> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>2</value> <description>Number of CPU cores that can be allocated for containers.</description> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> <description>shuffle service that needs to be set for Map Reduce to run </description> </property> </configuration>
Notice that the property "yarn.nodemanager.aux-services" is not necessary if you are not going to run and MapRecude Jobs.
Start your Hadoop single node:
Now that we just finished configuring HDFS and YARN, we can star up the single node cluster now:
We need to format namenode, for the first time only
Test the single node Hadoop cluster works fine:
This example will spawn a specified number of containers and runs a shell commmand in each of them. With the above command, we are telling hadoop to run client class in the hadoop-yarn-applications-distributedshell-2.2.0.jar, passing it the jar containing the definition of the ApplicationMaster (the same jar), the shell command to run in each of the hosts (date), the number of containers to spawn (2) and the memory used by the ApplicationMaster (1024MB). The value of 1024 was set empirically by trying to run the program several times until it stopped failing due to the ApplicationMaster using more memory than that which had been allocated to it.
Configure MapReduce:
To configure to run YARN-enabled MapReduce jobs, we need to configure MapReducev2. It is easy to configure, we just need to setup some reasonable defaults for the memory and CPU requirements in mapred-site.xmlto match those we defined for the YARN containers previously:
Create a mapred-site.xml from template:
Change it contents to:
We need to setup a prerequisite for MapReduce in yarn-site.xml, add
To test:
We need to format namenode, for the first time only
$ hdfs namenode -format
Start namenode and datanode: $ hadoop-daemon.sh start namenode starting namenode, logging to /home/lxu/opt/hadoop-2.2.0//logs/hadoop-lxu-namenode-tony.xu.com.out $ hadoop-daemon.sh start datanode starting datanode, logging to /home/lxu/opt/hadoop-2.2.0//logs/hadoop-lxu-datanode-tony.xu.com.out Start YARN: $ yarn-daemon.sh start resourcemanager starting resourcemanager, logging to /home/lxu/opt/hadoop-2.2.0//logs/yarn-lxu-resourcemanager-tony.xu.com.out $ yarn-daemon.sh start nodemanager starting nodemanager, logging to /home/lxu/opt/hadoop-2.2.0//logs/yarn-lxu-nodemanager-tony.xu.com.out
Test the single node Hadoop cluster works fine:
$HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.2.0.jar org.apache.hadoop.yarn.applications.distributedshell.Client --jar $HADOOP_PREFIX/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.2.0.jar --shell_command date --num_containers 2 --master_memory 1024
This example will spawn a specified number of containers and runs a shell commmand in each of them. With the above command, we are telling hadoop to run client class in the hadoop-yarn-applications-distributedshell-2.2.0.jar, passing it the jar containing the definition of the ApplicationMaster (the same jar), the shell command to run in each of the hosts (date), the number of containers to spawn (2) and the memory used by the ApplicationMaster (1024MB). The value of 1024 was set empirically by trying to run the program several times until it stopped failing due to the ApplicationMaster using more memory than that which had been allocated to it.
Configure MapReduce:
To configure to run YARN-enabled MapReduce jobs, we need to configure MapReducev2. It is easy to configure, we just need to setup some reasonable defaults for the memory and CPU requirements in mapred-site.xmlto match those we defined for the YARN containers previously:
Create a mapred-site.xml from template:
$ cp mapred-site.xml.template mapred-site.xml
Change it contents to:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>1024</value> </property> <property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Xmx768m</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> <description>Execution framework.</description> </property> <property> <name>mapreduce.map.cpu.vcores</name> <value>1</value> <description>The number of virtual cores required for each map task.</description> </property> <property> <name>mapreduce.reduce.cpu.vcores</name> <value>1</value> <description>The number of virtual cores required for each map task.</description> </property> <property> <name>mapreduce.map.memory.mb</name> <value>1024</value> <description>Larger resource limit for maps.</description> </property> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx768m</value> <description>Heap-size for child jvms of maps.</description> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>1024</value> <description>Larger resource limit for reduces.</description> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx768m</value> <description>Heap-size for child jvms of reduces.</description> </property> <property> <name>mapreduce.jobtracker.address</name> <value>jobtracker.alexjf.net:8021</value> </property> </configuration>Add "yarn.nodemanager.aux-services" in yarn-site.xml:
We need to setup a prerequisite for MapReduce in yarn-site.xml, add
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> <description>shuffle service that needs to be set for Map Reduce to run </description> </property>if you don't have it in yarn-site.xml
Then restart resoucemanager and nodemanager: $ yarn-daemon.sh stop nodemanager $ yarn-daemon.sh stop resourcemanager $ yarn-daemon.sh start resourcemanager $ yarn-daemon.sh start nodemanager
To test:
$HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar randomwriter out 14/01/29 10:10:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/01/29 10:10:52 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 Running 10 maps. ..... RECORDS_WRITTEN=1021881 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=10772846294 Job ended: Wed Jan 29 10:14:17 EST 2014 The job took 203 seconds.
Monitor jobs:
You can monitor job status in your web browser, by pointing it to http://localhost:8088
To get a list of Hadoop daemons running, use jps:
$ jps 4471 NodeManager 4572 org.eclipse.equinox.launcher_1.3.0.v20130327-1440.jar 3664 DataNode 4072 4168 ResourceManager 3858 NameNode
No comments:
Post a Comment