Since this is a long tutorial, it will split into several parts.
Apache Hadoop - How to Install a Three Nodes Cluster (2)
Apache Hadoop - How to Install a Three Nodes Cluster (3)
Apache Hadoop - How to Install a Three Nodes Cluster (4)
Hadoop version:
hadoop-2.2.0
VMware version:
VMware Workstation 10.0.1 build-1379776
Operating system:
The first thing you need to do is to decide which OS you will use for Hadoop. While most part of Hadoop is written in Java, but Linux is the only production-quality option right now. Hortonworks does provide a Windows version of Hadoop, Hortonworks Data Platform 1.3 for Windows for Windows Server 2008 R2 and 2012 (http://hortonworks.com/blog/getting-started-with-hadoop-on-windows-with-hdp-1-3/), and also win-hadoop (http://sourceforge.net/p/win-hadoop/wiki/Hadoop-on-Cygwin/) which requires Cygwin, but I still strongly recommend you to choose Linux as your Hadoop cluster OS. Why? :
GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so it is not supported as a production platform.
Even though Windows has much better remote management support than most people realize, but it's still tough to beat Linux when it comes to the ease (and price tag) of setting up a large compute farm. This is just a guess, but perhaps it's less likely that researchers who need to build such massive clusters want to put much of their budget toward OS licensing.
Your choice of OS may be influenced by your corporate platform, the administration tools you use, your hardware support; but the best choice is to choose the Linux distribution that you’re most familiar. It also depends on the Hadoop version (Cloudera CDH, Hortonworks and MapR, even native Apache Hadoop) you choose. I use CentOS 6.5 (64-bit)
Preparation:
In VMware, create three CentOS virtual machines. I use “CentOS-6.5-x86_64-LiveCD.iso”. Set hostnames to "hadoop1.com", "hadoop2.com" and "hadoop3.com".
Update your OS:
Get the latest version of packages from repository by running “yum update”.
Disable SELinux:
In order to avoid any unnecessary installation problems. Let's disable SELinux first, it is on by default:
# getenforce Enforcing # setenfoce 0 # getenforce Permissive # vi /etc/selinux/config set SELINUX=disabled
Create Hadoop users and groups:
Hadoop daemons should not run under root user. HDFS and YARN daemons should also run as different Unix users. We are going to create three users: hdfs, yarn and mapred and one group called hadoop.
On all three servers:
# useradd hdfs # useradd yarn # useradd mapred # groupadd hadoop # usermod -a -G hadoop hdfs # usermod -a -G hadoop yarn # usermod -a -G hadoop mapred # id hdfs uid=501(hdfs) gid=501(hdfs) groups=501(hdfs),504(hadoop) # id yarn uid=502(yarn) gid=502(yarn) groups=502(yarn),504(hadoop) # id mapred uid=503(mapred) gid=503(mapred) groups=503(mapred),504(hadoop)
Software requirements:
Hadoop requires few external software packages:
- Java Development Kit (JDK)
- Network Time Protocol (NTP)
- Secure Shell (ssh)
- Mail Transfer Agent (MTA, such as sendmail or qmail)
- Domain Name service (DNS)
To make sure everything works correctly, symbolically link the directory where you install the JDK to /usr/java/default on Red Hat and similar systems, or to /usr/lib/jvm/default-java on Ubuntu and Debian systems.
$ ll /usr/java/ total 4 lrwxrwxrwx 1 root root 16 May 1 2013 default -> /usr/java/latest lrwxrwxrwx 1 root root 21 May 2 2013 latest -> /usr/lib/jvm/jdk1.7.0
Download JDK 64-bit from http://www.oracle.com/technetwork/java/javase/downloads/index.html
Then:
# tar -xzf jdk-7u51-linux-x64.tar.gz # mkdir /usr/lib/jvm # mv ./jdk1.7.0_51 /usr/lib/jvm/jdk1.7.0 # update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0/bin/java" 1 # update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0/bin/javac" 1 # update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.7.0/bin/javaws" 1 # update-alternatives --config java # update-alternatives --config javac # update-alternatives --config javaws
Install NTP:
# yum install ntp # chkconfig ntpd on # chkconfig --list | grep ntpd # service ntpd start # ps -ef | grep ntp
If you have your own time server, update the /etc/ntp.conf file to use your own server, otherwise you can keep the public servers.
server 0.centos.pool.ntp.org server 1.centos.pool.ntp.org server 2.centos.pool.ntp.org
Install SSH:
SSH should come with CentOS 6.x, make sure your firewall is not blocking the communications between Hadoop clusters and prepare a Administrator account for installation.
Install Mail Transfer Agent:
It is typical to install “mailx” - a sending and receiving facility for mail on a Linux system. Sendmail is a more professional mail server service allowing you to handle both incoming and outgoing mail requests. However sendmail is complicated to configure. If you are not looking to receive mail and are only looking to simply send mail out, use “mailx(nail)”.
# yum install mailx Now send a test Email to check it works properly. # echo "Test Email" | mail -s "This is a test email." externalemail@domain.com You can check if anything is in the mailbox with $ mailq Mail queue is empty
Install DNS:
For small Hadoop clusters, it doesn’t matter if you use a DNS server or /etc/hosts file for servers to find each other, but for large Hadoop clusters, it is better to have your own DNS server. It will save your time to update hosts file. The most important thing of DNS server to Hadoop is to make sure forward and reverse DNS match explicitly. To setup a DNS server, follow the guide:
http://tonylixu.blogspot.ca/2013/09/how-to-set-up-two-in-one-dns-server-or.html
http://tonylixu.blogspot.ca/2013/09/how-to-set-up-two-in-one-dns-server-or_18.html
To test:
http://tonylixu.blogspot.ca/2013/12/hadoop-host-identification-process.html
Kernel Tuning:
Change limits:
Cloudera recommends increasing the number of file handles to more than 10,000.
# vi /etc/security/limits.conf add --------------------------------- hdfs - nofile 1048576 hdfs - nproc 32000 mapred - nofile 1048576 mapred - nproc 32000 hbase - nofile 1048576 hbase - nproc 32000 hive - nofile 1048576 ---------------------------------
Set swappiness:
Swapping Hadoop daemon data to disk can cause operations to timeout and potentially fail if the disk is performing other I/O operations. This is especially dangerous for HBase as Region Servers must maintain communication with ZooKeeper lest they be marked as failed. To avoid this, vm.swappiness should be set to 0 (zero) to instruct the kernel to never swap application data, if there is an option. Most Linux distributions ship with vm.swappiness set to 60 or even as high as 80.
# echo 0 > /proc/sys/vm/swappiness # cat /proc/sys/vm/swappiness
Set overcommit_memory:
So why does this matter to Hadoop? Hadoop Streaming—a library that allows MapReduce jobs to be written in any language that can read from standard in and write to standard out—works by forking the user’s code as a child process and piping data through it. This means that not only do we need to account for the memory the Java child task uses, but also that when it forks, for a moment in time before it execs, it uses twice the amount of memory we’d expect it to. For this reason, it is sometimes necessary to set vm.overcommit_memory to the value 1 (one) and adjust vm.overcommit_ratio accordingly.
# echo 1 > /proc/sys/vm/overcommit_memory # cat /proc/sys/vm/overcommit_memory # vi /etc/sysctl.conf add the following (dash not included) --------------------------------- # Hadoop Kernel tuning vm.swappiness=0 vm.overcommit_memory=1 ---------------------------------
No comments:
Post a Comment