Thursday, December 12, 2013

Hadoop Namenode High Availability (QJM)

What is Namenode HA?
In a Hadoop cluster which has HA enabled, there are two separate machines that are configured as NameNodes. At any given time, one of the NameNodes is in active state and the other one is in a standby state. Active namenode is responsible for all the client operations and standby namenode is maintaining its state to provide a fast failover when necessary.

There are two ways of implementing namdenode HA:
1. Quorum-based Storage
2. Shared storage using NFS

Quorum-based Storage refers to the HA implementation that uses Quorum Journal Manager (QJM). In QJM, both namenodes communicate with JournalNodes to keep its state synchronized. When any namespace or metadata modification is performed by the active namenode, it durably logs a record of the modification to majority of JournalNodes. The Standby node is capable of reading the edits from the JournalNodes, and is constantly watching them for changes to the edit log. When standby namenode sees a change, it applies them to its own namespace. During a failover, standby namnodes will ensure that it has read and implmented all the changes before promoting itself to the active state. This way, it ensures that namespace state is fully synchronized.

How fast is a fast failover?
In order to provide a fast failover, it is also necessary that the Standby node has up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and they send block location information and heartbeats to both. The failover time is somewhere about 20~30 sec and 20sec timeout is hardcoded in the IPC layer of Hadoop. Or if you use ZooKeeper FailoverController (ZKFC), it is determined by ha.zookeeper.session-timeout.ms which is 5 seconds by default.

What's the hardware requirements for QJM HA?
You need at least:

  • NameNode machines - the machines on which you run the Active and Standby NameNodes should have equivalent hardware to each other, and equivalent hardware to what would be used in a non-HA cluster.
  • JournalNode machines - the machines on which you run the JournalNodes.

Note: There must be at least three JournalNode daemons, since edit log modifications must be written to a majority of JournalNodes. This also allows the system to tolerate the failure of a single machine. Of course you can run more than three JNs, but you should keep an odd number of JNs. When running with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to function normally. It is remommended that you deploy the JournalNode daemons on the "master" host or hosts, i.e NameNode, Standby NameNode, JobTracker, etc. So the JournalNodes's local directories can use the reliable local storage.

Install and Start the JournalNodes:
Install the JournalNode daemons on each of the machines where they will run.
# yum install hadoop-hdfs-journalnode
Start the JournalNode daemons on each of the machines where they will run:
# service hadoop-hdfs-journalnode start

Initialize the Shared Edits directory (Skip if you are not converting a non-HA NameNode to HA):
# hdfs namenode -initializeSharedEdits
Start the NameNodes:
# service hadoop-hdfs-namenode start
Start the standby NameNode:
$ sudo -u hdfs hdfs namenode -bootstrapStandby
# service hadoop-hdfs-namenode start
Starting the standby NameNode with the -bootstrapStandby option copies over the contents of the primary NameNode's metadata directories (including the namespace information and most recent checkpoint) to the standby NameNode.

Restart JobTracker, TaskTracker and DataNode:
On all datanodes:
# service hadoop-hdfs-datanode start

On all TaskTracker nodes:
#service hadoop-0.20-mapreduce-tasktracker start

On JobTracker node:
# service hadoop-0.20-mapreduce-jobtracker start

Deploy Automatic Failover:
If you have configured automatic failover using the ZooKeeper FailoverController (ZKFC), you must install and start the zkfc daemon on each of the machines that runs a NameNode.
# yum install hadoop-hdfs-zkfc

Make sure you configured the following parameters in hdfs-site.xml:
<property>
    <name>dfs.ha.automatic-failover.enabled.nameservice1</name>
    <value>true</value>
</property>

Set this to true will instruct the startup scripts to additionally start the failover controller process and manage namenode state transitions using ZooKeeper for coordination.

<property>
    <name>ha.zookeeper.quorum</name>
    <value>zookeeper1:2181,zookeeper2:2181,zookeeper3:2181</value>
</property>

When using the HDFS automatic failover feature, ZooKeeper must be properly configured. This property specifies the nodes that make up the ZooKeeper quorum.

Restart HDFS service.
# service hadoop-hdfs-zkfc start

Testing Automatic Failover:
Go to the active namenode, locate the pid of nn service, issue:
# kill -9 pid_nn

You should see the standby namenode becomes active within couple of seconds. The amount of time required to detect a failure and trigger a failover depends on the configuration of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.

Or you can use Cloudera Manager to configure your CDH4 cluster for HDFS High Availability (HA). Just follow the steps in the following page:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Free/4.5.1/Cloudera-Manager-Free-Edition-User-Guide/cmfeug_topic_5_8.html

No comments: