Saturday, February 22, 2014

Apache Hadoop - How to Install a Three Nodes Cluster (1)

This tutorial shows you how to install a three nodes Apache hadoop cluster in VMware. No pre-build Hadoop Appliance is used in this tutorial, we will be using Apache Hadoop 2.2.0 directly.

Since this is a long tutorial, it will split into several parts.
Apache Hadoop - How to Install a Three Nodes Cluster (2)
Apache Hadoop - How to Install a Three Nodes Cluster (3)
Apache Hadoop - How to Install a Three Nodes Cluster (4)

Hadoop version:
hadoop-2.2.0

VMware version:
VMware Workstation 10.0.1 build-1379776

Operating system:
The first thing you need to do is to decide which OS you will use for Hadoop. While most part of Hadoop is written in Java, but Linux is the only production-quality option right now. Hortonworks does provide a Windows version of Hadoop, Hortonworks Data Platform 1.3 for Windows for Windows Server 2008 R2 and 2012 (http://hortonworks.com/blog/getting-started-with-hadoop-on-windows-with-hdp-1-3/), and also win-hadoop (http://sourceforge.net/p/win-hadoop/wiki/Hadoop-on-Cygwin/) which requires Cygwin, but I still strongly recommend you to choose Linux as your Hadoop cluster OS. Why? :

GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.

Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so it is not supported as a production platform.

Even though Windows has much better remote management support than most people realize, but it's still tough to beat Linux when it comes to the ease (and price tag) of setting up a large compute farm. This is just a guess, but perhaps it's less likely that researchers who need to build such massive clusters want to put much of their budget toward OS licensing.

Your choice of OS may be influenced by your corporate platform, the administration tools you use, your hardware support; but the best choice is to choose the Linux distribution that you’re most familiar. It also depends on the Hadoop version (Cloudera CDH, Hortonworks and MapR, even native Apache Hadoop) you choose. I use CentOS 6.5 (64-bit)

Preparation:
In VMware, create three CentOS virtual machines. I use “CentOS-6.5-x86_64-LiveCD.iso”. Set hostnames to "hadoop1.com", "hadoop2.com" and "hadoop3.com".

Update your OS:
Get the latest version of packages from repository by running “yum update”.

Disable SELinux:
In order to avoid any unnecessary installation problems. Let's disable SELinux first, it is on by default:
# getenforce
Enforcing
# setenfoce 0
# getenforce
Permissive
# vi /etc/selinux/config
set
SELINUX=disabled

Create Hadoop users and groups:
Hadoop daemons should not run under root user. HDFS and YARN daemons should also run as different Unix users. We are going to create three users: hdfs, yarn and mapred and one group called hadoop.

On all three servers:
# useradd hdfs
# useradd yarn
# useradd mapred
# groupadd hadoop
# usermod -a -G hadoop hdfs
# usermod -a -G hadoop yarn
# usermod -a -G hadoop mapred
# id hdfs
uid=501(hdfs) gid=501(hdfs) groups=501(hdfs),504(hadoop)
# id yarn
uid=502(yarn) gid=502(yarn) groups=502(yarn),504(hadoop)
# id mapred
uid=503(mapred) gid=503(mapred) groups=503(mapred),504(hadoop)



Software requirements:
Hadoop requires few external software packages:
  • Java Development Kit (JDK)
  • Network Time Protocol (NTP)
  • Secure Shell (ssh)
  • Mail Transfer Agent (MTA, such as sendmail or qmail)
  • Domain Name service (DNS)
Install JDK:
To make sure everything works correctly, symbolically link the directory where you install the JDK to /usr/java/default on Red Hat and similar systems, or to /usr/lib/jvm/default-java on Ubuntu and Debian systems.
$ ll /usr/java/
total 4
lrwxrwxrwx 1 root root   16 May  1  2013 default -> /usr/java/latest
lrwxrwxrwx 1 root root   21 May  2  2013 latest -> /usr/lib/jvm/jdk1.7.0

Download JDK 64-bit from http://www.oracle.com/technetwork/java/javase/downloads/index.html
Then:
# tar -xzf jdk-7u51-linux-x64.tar.gz
# mkdir /usr/lib/jvm
# mv ./jdk1.7.0_51 /usr/lib/jvm/jdk1.7.0
# update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0/bin/java" 1
# update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0/bin/javac" 1
# update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.7.0/bin/javaws" 1
# update-alternatives --config java
# update-alternatives --config javac
# update-alternatives --config javaws

Install NTP:
# yum install ntp
# chkconfig ntpd on
# chkconfig --list | grep ntpd
# service ntpd start
# ps -ef | grep ntp

If you have your own time server, update the /etc/ntp.conf file to use your own server, otherwise you can keep the public servers.
server 0.centos.pool.ntp.org
server 1.centos.pool.ntp.org
server 2.centos.pool.ntp.org

Install SSH:
SSH should come with CentOS 6.x, make sure your firewall is not blocking the communications between Hadoop clusters and prepare a Administrator account for installation.

Install Mail Transfer Agent:
It is typical to install “mailx” - a sending and receiving facility for mail on a Linux system. Sendmail is a more professional mail server service allowing you to handle both incoming and outgoing mail requests. However sendmail is complicated to configure. If you are not looking to receive mail and are only looking to simply send mail out, use “mailx(nail)”.
# yum install mailx
Now send a test Email to check it works properly.
# echo "Test Email" | mail -s "This is a test email." externalemail@domain.com
You can check if anything is in the mailbox with
$ mailq
Mail queue is empty

Install DNS:
For small Hadoop clusters, it doesn’t matter if you use a DNS server or /etc/hosts file for servers to find each other, but for large Hadoop clusters, it is better to have your own DNS server. It will save your time to update hosts file. The most important thing of DNS server to Hadoop is to make sure forward and reverse DNS match explicitly. To setup a DNS server, follow the guide:
http://tonylixu.blogspot.ca/2013/09/how-to-set-up-two-in-one-dns-server-or.html
http://tonylixu.blogspot.ca/2013/09/how-to-set-up-two-in-one-dns-server-or_18.html

To test:
http://tonylixu.blogspot.ca/2013/12/hadoop-host-identification-process.html

Kernel Tuning:

Change limits:
Cloudera recommends increasing the number of file handles to more than 10,000.
# vi /etc/security/limits.conf
add
---------------------------------
hdfs             -       nofile          1048576
hdfs             -       nproc           32000
mapred                 -         nofile                 1048576
mapred                 -         nproc                 32000
hbase            -       nofile          1048576
hbase            -       nproc           32000
hive             -       nofile          1048576
---------------------------------

Set swappiness:
Swapping Hadoop daemon data to disk can cause operations to timeout and potentially fail if the disk is performing other I/O operations. This is especially dangerous for HBase as Region Servers must maintain communication with ZooKeeper lest they be marked as failed. To avoid this, vm.swappiness should be set to 0 (zero) to instruct the kernel to never swap application data, if there is an option. Most Linux distributions ship with vm.swappiness set to 60 or even as high as 80.

# echo 0 > /proc/sys/vm/swappiness
# cat /proc/sys/vm/swappiness

Set overcommit_memory:
So why does this matter to Hadoop? Hadoop Streaming—a library that allows MapReduce jobs to be written in any language that can read from standard in and write to standard out—works by forking the user’s code as a child process and piping data through it. This means that not only do we need to account for the memory the Java child task uses, but also that when it forks, for a moment in time before it execs, it uses twice the amount of memory we’d expect it to. For this reason, it is sometimes necessary to set vm.overcommit_memory to the value 1 (one) and adjust vm.overcommit_ratio accordingly.

# echo 1 >  /proc/sys/vm/overcommit_memory
# cat /proc/sys/vm/overcommit_memory
# vi /etc/sysctl.conf
add the following (dash not included)
---------------------------------
# Hadoop Kernel tuning
vm.swappiness=0
vm.overcommit_memory=1
---------------------------------

No comments: