Thursday, January 09, 2014

Hadoop (CDH) Installation Prerequisites

This tutorial doesn't cover the hardware preparation, I assume hardware/servers are ready for installation. Even though this tutorial uses Cloudera’s CDH version, but it can be used as a reference for other Hadoop version installations.

Operating system:
The first thing you need to do is to decide which OS you will use for Hadoop. While most part of Hadoop is written in Java, but Linux is the only production-quality option right now. Hortonworks does provide a Windows version of Hadoop, Hortonworks Data Platform 1.3 for Windows for Windows Server 2008 R2 and 2012 (, and also win-hadoop ( which require Cygwin, but I still strongly recommned you to choose Linux as your Hadoop cluster OS. Why? because:

GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so it is not supported as a production platform.

Even though Windows has much better remote management support than most people realize, but it's still tough to beat Linux when it comes to the ease (and price tag) of setting up a large compute farm. This is just a guess, but perhaps it's less likely that researchers who need to build such massive clusters want to put much of their budget toward OS licensing.

Your choice of OS may be influenced by your corporate platform, the administration tools you use, your hardware support; but the best choice is to choose the Linux distribution that you’re most familiar. It also depends on the Hadoop version (Cloudera CDH, Hortonworks and MapR, even native Apache Hadoop) you choose.

Since this tutorial is about how to install Hadoop using Cloudera Manager/Standard, I will use what’s recommend by Cloudera. Cloudera CDH4 provides packages for Red-Hat-compatible, SLES, Ubuntu, and Debian systems.

Red Hat Enterprise Linux (RHEL):
5.7(64-bit), 6,2(32/64-bit), 6.4(64-bit)

5.7(64-bit), 6,2(32/64-bit), 6.4(64-bit)

Oracle Linux with Unbreakable Enterprise Kernel:
5.6(64-bit), 6.4(64-bit)

SLES Linux Enterprise Server (SLES):
11 with Service Pack 1 or later(64-bit)

Lucid (10.04) - Long-Term Support (LTS) (64-bit)
Precise (12.04) - Long-Term Support (LTS) (64-bit)

Squeeze (6.0.3) (64-bit)

For production environments, 64-bit packages are recommended. Except as noted above, CDH4 provides only 64-bit packages.
Cloudera has received reports that our RPMs work well on Fedora, but they have not tested this.

For demonstration purposes, I will use CentOS.

Software Preparation:
Hadoop requires few external software packages:
  • Java Development Kit (JDK)
  • Network Time Protocol (NTP)
  • Secure Shell (ssh)
  • Mail Transfer Agent (MTA, such as sendmail or qmail)
  • Domain Name service (DNS)

Install JDK:
CDH4 is supported with Oracle JDK, as of Cloudera Manager 4.7+ and CDH4.4+, it supports Oracle JDK 7(JDK 1.7). But with the following restrictions:
All CDH components must be running the same major version (that is, all deployed on JDK 6 or all deployed on JDK 7). For example, you cannot run Hadoop on JDK 6 while running Sqoop on JDK 7.
All nodes in the cluster must be running the same major JDK version: Cloudera does not support mixed environments (some nodes on JDK6 and others on JDK7).

To make sure everything works correctly, symbolically link the directory where you install the JDK to /usr/java/default on Red Hat and similar systems, or to /usr/lib/jvm/default-java on Ubuntu and Debian systems.
$ ll /usr/java/
total 4
lrwxrwxrwx 1 root root   16 May  1  2013 default -> /usr/java/latest
lrwxrwxrwx 1 root root   21 May  2  2013 latest -> /usr/lib/jvm/jdk1.7.0

Download JDK 64-bit from
# tar -xzf jdk-7u2-linux-x64.tar.gz
# mkdir /usr/lib/jvm
# mv ./jdk1.7.0_02 /usr/lib/jvm/jdk1.7.0
# update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0/bin/java" 1
# update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0/bin/javac" 1
# update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.7.0/bin/javaws" 1
# update-alternatives --config java
# update-alternatives --config javac
# update-alternatives --config javaws

Install NTP:
# yum install ntp
# chkconfig ntpd on
# chkconfig --list | grep ntpd
# service ntpd start
# ps -ef | grep ntp

If you have your own time server, update the /etc/ntp.conf file to use your own server, otherwise you can keep the public servers.

Install SSH:
SSH should come with CentOS 6.x, make sure your firewall is not blocking the communications between Hadoop clusters and prepare a Administrator account for installation.

Install Mail Transfer Agent:
It is typical to install “mailx” - a sending and receiving facility for mail on a Linux system. Sendmail is a more professional mail server service allowing you to handle both incoming and outgoing mail requests. However sendmail is complicated to configure. If you are not looking to receive mail and are only looking to simply send mail out, use “mailx(nail)”.
# yum install mailx
Now send a test Email to check it works properly.
# echo "Test Email" | mail -s "This is a test email."
You can check if anything is in the mailbox with
$ mailq
Mail queue is empty

Install DNS:
For small Hadoop clusters, it doesn’t matter if you use a DNS server or /etc/hosts file for servers to find each other, but for large Hadoop clusters, it is better to have your own DNS server. It will save your time to update hosts file. The most important thing of DNS server to Hadoop is to make sure forward and reverse DNS match explicitly. To setup a DNS server, follow the guide:

To test:

Kernel Tuning:

Change limits:
Cloudera recommends increasing the number of file handles to more than 10,000.
# vi /etc/security/limits.conf
hdfs             -       nofile          1048576
hdfs             -       nproc           32000
mapred            -       nofile           1048576
mapred            -       nproc           32000
hbase            -       nofile          1048576
hbase            -       nproc           32000
hive             -       nofile          1048576

Set swappiness:
Swapping Hadoop daemon data to disk can cause operations to timeout and potentially fail if the disk is performing other I/O operations. This is especially dangerous for HBase as Region Servers must maintain communication with ZooKeeper lest they be marked as failed. To avoid this, vm.swappiness should be set to 0 (zero) to instruct the kernel to never swap application data, if there is an option. Most Linux distributions ship with vm.swappiness set to 60 or even as high as 80.

# echo 0 > /proc/sys/vm/swappiness
# cat /proc/sys/vm/swappiness

Set overcommit_memory:
So why does this matter to Hadoop? Hadoop Streaming—a library that allows MapReduce jobs to be written in any language that can read from standard in and write to standard out—works by forking the user’s code as a child process and piping data through it. This means that not only do we need to account for the memory the Java child task uses, but also that when it forks, for a moment in time before it execs, it uses twice the amount of memory we’d expect it to. For this reason, it is sometimes necessary to set vm.overcommit_memory to the value 1 (one) and adjust vm.overcommit_ratio accordingly.

# echo 1 >  /proc/sys/vm/overcommit_memory
# cat /proc/sys/vm/overcommit_memory

To change permanently:
# vi /etc/sysctl.conf
add the following (dash not included)
# Hadoop Kernel tuning

Once you have all pre-required software installed, you can start CDH installation. Follow the instructions here to install CDH:
How to install CDH using Cloudera Standard

No comments: