Thursday, March 06, 2014

Apache Hadoop - How to Configure Namenode HA (High Availability)

To configure HA Namenodes, we need to modify the hdfs-site.xml configuration file.

ENV:
OS: CentOS 6.5
hadoop-2.2.0

Architecture:
hadoop1.com - namenode
hadoop2.com - secondary namenode
hadoop3.com - datanode

First of all, two important parameters you should consider carefully:
dfs.nameservices - the logical name for this new nameservice. Choose a logical name for this nameservice, for example "testcluster", and use this logical name for the value of this config option. The name you choose is arbitrary. It will be used both for configuration and as the authority component of absolute HDFS paths in the cluster.

Note: If you are also using HDFS Federation, this configuration setting should aslo include the list of other nameservicves, HA or otherwise, as a comma-separated list. Below is an example:
<property>
    <name>dfs.nameservices</name>
    <value>testcluster</value>
    <description>Comma-separated list of logical nameservices.</description>
</property>


dfs.ha.namenodes.[nameservice ID] - unique identifiers for each NameNode in the nameservice. This property is configured with a list of comma-separated Namenode IDs. This will be used by datanodes to determine all the namenodes in the cluster. For example, if you used "testclulster" as the nameservice ID previously, and you wanted to use "nn1" and "nn2" as the individual IDs of namenodes, you would configure this peroperty to:
<property>
    <name>dfs.ha.namenodes.testcluster</name>
    <value>nn1,nn2</value>
    <description>Unique identifiers for each NameNode in the nameservice.</description>
</property>


Other necessary configuration parameters:
dfs.namenode.rpc-address.[nameservice ID].[name node ID] - the fully-qualified RPC address for each namenode to listen on. For both of the previously-configured namenode IDs, set the full address and IPC port of the namenode process. Note that this results in two separate configuration options.
<property>
    <name>dfs.namenode.rpc-address.testcluster.nn1</name>
    <value>hadoop1.com:8020</value>
    <description>The fully-qualified RPC address for each namenode to listen on</description>
</property>

<property>
    <name>dfs.namenode.rpc-address.testcluster.nn2</name>
    <value>hadoop2.com:8020</value>
    <description>The fully-qualified RPC address for each namenode to listen on</description>
</property>

dfs.namenode.shared.edits.dir - the URI which identifies the group of HNs where namnodes will write/read/edits. This is where one configures the address of the journalnodes which provide the shared edits storage, written to by the Active namenode and read by the standy namenode to stay up-to-date with all the file system changes the active namnode makes. Though you must specify several journalnode addresses, you should only configre one of these URIs. The URI shoule be of the form: "qjournal://host1:port1;host2:port2;host3:port3/journalId". The Journal ID is a unique identifier for this nameservice, which allows a single set of JournalNodes to provide storage for multiple federated namesystems. For example, if the JournalNodes for this test cluster were running on the machines "hadoop1.com", "hadoop2.com", "hadoop3.com" and the nameservice ID were "testcluster", you would use the following as the value for this setting:
<property>
    <name>dfs.namenode.shared.edits.dir</name
    <value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>
    <description>The URI which identifies the group of HNs where namnodes will write/read/edits.</description>
</property>

dfs.client.failover.proxy.provider.[nameservice ID] - the Java class that HDFS clients use to contact the Active namenode. Configure the name of the Java class which will be used by the DFS Client to determine which namenode is the current Active, and therefore which namenode is currently serving client requests. The only implementation which currently ships with Hadoop is the ConfiguredFailoverProxyProvider, so use this unless you are using a custom one:
<property>
    <name>dfs.client.failover.proxy.provider.mycluster</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    <description>The Java class that HDFS clients use to contact the Active namenode.</description>
</property>

dfs.ha.fencing.methods - a list of scripts or Java classes which will be used to fence the Active namenode during a failover.
Note: It is desirable for correctness of the system that only one namenode be in Active state at any given time. Importantly, when using the Quorum Journal Manager, only one namenode will ever be allowed to write to the JournalNode, so there is no potential for corrupting the file system metadata from a split-brain scenario. However, when a failover occurs, it is still possible that the previous Active namenode could server read requests to clients, which may be out of date until that namenode shuts down when trying to write to the journalnodes. For this reason, it is still desirable to confiure some fencing methods even when using the QJM. However, to improve the availability of the system in the event of fencing mechanisms fail, it is advisable to configure a fencing method which is guaranteed to return success as the last fencing method in the list. Note that if you choose to use no actual fencing methods, you still must configure something for this setting, for example "shell(/bin/true)".

The fencing methods used during a failover are configured as a carriage-return-separated list, which will be attempted in order until one indicates that fencing has succeeded. There are two methods which ship with Hadoop: shell and sshfence. For information on implementing your own custom fencing method, see the org.apache.hadoop.ha.NodeFencer class.

<property>
    <name>dfs.ha.fencing.methods</name>
    <value>sshfence</value>
</property>


fs.defaultFS - the default path prefix used by the Hadoop FS client when none is given
Optionally, you may now configure the default path for Hadoop clients to use the new HA-enabled logical URI. If you used "mycluster" as the nameservice ID earlier, this will be the value of the authority portion of all of your HDFS paths. This may be configured like so, in your core-site.xml file:

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://testcluster</value>
</property>

dfs.journalnode.edits.dir - the path where the JournalNode daemon will store its local state. This is the absolute path on the JournalNode machines where the edits and other local state used by the JNs will be stored. You may only use a single path for this configuration. Redundency for this data is provided by running multiple separate JournalNodes, or by configuring this directory on a locally-attached RAID array.

<property>
  <name>dfs.journalnode.edits.dir</name>
  <value>/data/jn/</value>
</property>

After all necessary configuration options have been set, you must start the JournalNode daemons on the set of machines where they will run.
To start JournalNode daemons (hadoop1/2/3):
# su hdfs
$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start journalnode

Once the JournalNodes have been started on all quorum servers, you must initially synchronize the two HA namenode's on-disk metadata.

  • If you are setting up a fresh HDFS cluster, you should first run the format command (hdfs namenode -format) on one of NameNodes.
  • If you have already formatted the NameNode, or are converting a non-HA-enabled cluster to be HA-enabled, you should now copy over the contents of your NameNode metadata directories to the other, unformatted NameNode by running the command "hdfs namenode -bootstrapStandby" on the unformatted NameNode. Running this command will also ensure that the JournalNodes (as configured by dfs.namenode.shared.edits.dir) contain sufficient edits transactions to be able to start both NameNodes.
  • If you are converting a non-HA NameNode to be HA, you should run the command "hdfs -initializeSharedEdits", which will initialize the JournalNodes with the edits data from the local NameNode edits directories.

Stop HDFS services:
First of all, stop namenode, secondary namenode and datanode process on hadoop1/2/3.
On hadoop1
# su hdfs
$ ./stop_namenode.sh

On hadoop2
# su hdfs
$ ./stop_namenode2.sh

On hadoop3
# su hdfs
$ ./stop_datanode.sh

Start HDFS services:
On hadoop1:
Start namenode service:
# su hdfs
$ ./start_namenode.sh

On hadoop2, copy over the contents of NameNode metadata:
# su hdfs
$ /opt/hadoop-2.2.0/bin/hdfs namenode -bootstrapStandby
14/03/06 12:09:12 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop2.com/192.168.143.132
STARTUP_MSG:   args = [-bootstrapStandby]
STARTUP_MSG:   version = 2.2.0
STARTUP_MSG:   classpath = /opt/hadoop-2.2.0/etc/hadoop:/opt/hadoop-2.2.0/share/hadoop/common/lib/jasper-runtime-5.5.23.jar:.../contrib/capacity-scheduler/*.jar
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common -r 1529768; compiled by 'hortonmu' on 2013-10-07T06:28Z
STARTUP_MSG:   java = 1.7.0_51
************************************************************/
14/03/06 12:09:12 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /opt/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/03/06 12:09:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/03/06 12:09:13 WARN common.Util: Path /data/nn1 should be specified as a URI in configuration files. Please update hdfs configuration.
14/03/06 12:09:13 WARN common.Util: Path /data/nn2 should be specified as a URI in configuration files. Please update hdfs configuration.
14/03/06 12:09:13 WARN common.Util: Path /data/nn1 should be specified as a URI in configuration files. Please update hdfs configuration.
14/03/06 12:09:13 WARN common.Util: Path /data/nn2 should be specified as a URI in configuration files. Please update hdfs configuration.
=====================================================

About to bootstrap Standby ID nn2 from:
           Nameservice ID: testcluster
        Other Namenode ID: nn1
  Other NN's HTTP address: hadoop1.com:50070
  Other NN's IPC  address: hadoop1.com/192.168.143.131:8020
             Namespace ID: 655756118
            Block pool ID: BP-1154565833-192.168.143.131-1392849632450
               Cluster ID: CID-20ca12e6-c817-4ff1-bade-f725954fd6e2
           Layout version: -47
=====================================================

14/03/06 12:09:14 INFO common.Storage: Storage directory /data/nn1 has been successfully formatted.
14/03/06 12:09:14 INFO common.Storage: Storage directory /data/nn2 has been successfully formatted.
14/03/06 12:09:14 WARN common.Util: Path /data/nn1 should be specified as a URI in configuration files. Please update hdfs configuration.
14/03/06 12:09:14 WARN common.Util: Path /data/nn2 should be specified as a URI in configuration files. Please update hdfs configuration.
14/03/06 12:09:14 WARN common.Util: Path /data/nn1 should be specified as a URI in configuration files. Please update hdfs configuration.
14/03/06 12:09:14 WARN common.Util: Path /data/nn2 should be specified as a URI in configuration files. Please update hdfs configuration.
14/03/06 12:09:15 INFO namenode.TransferFsImage: Opening connection to http://hadoop1.com:50070/getimage?getimage=1&txid=2539&storageInfo=-47:655756118:0:CID-20ca12e6-c817-4ff1-bade-f725954fd6e2
14/03/06 12:09:15 INFO namenode.TransferFsImage: Transfer took 0.29s at 55.36 KB/s
14/03/06 12:09:15 INFO namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000002539 size 16855 bytes.
14/03/06 12:09:15 INFO util.ExitUtil: Exiting with status 0
14/03/06 12:09:15 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop2.com/192.168.143.132
************************************************************/

Then we start the namenode service on hadoop2:
$ ./start_namenode.sh

Check the log file on both hadoop1 and hadoop2, make sure there are no errors.

Note: If you see error like this:
 Journal Storage Directory /data/jn/testcluster is not formatted
This means you haven't initialize the shared edits directory with the edits data from the local NameNode edits directories yet. 

To initialize edits directory (you only need to run this command once, I ran it on hadoop1):
$ hdfs namenode -initializeSharedEdits
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop1.com/192.168.143.131
STARTUP_MSG:   args = [-initializeSharedEdits]
STARTUP_MSG:   version = 2.2.0
STARTUP_MSG:   classpath = /opt/hadoop-2.2.0/etc/hadoop:/opt/hadoop-2.2.0/share/hadoop/common/lib/jasper-runtime-5.5.23.jar.../hadoop-mapreduce-examples-2.2.0.jar:/contrib/capacity-scheduler/*.jar
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common -r 1529768; compiled by 'hortonmu' on 2013-10-07T06:28Z
STARTUP_MSG:   java = 1.7.0_51
************************************************************/
14/03/06 12:17:33 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /opt/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/03/06 12:17:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/03/06 12:17:35 WARN common.Util: Path /data/nn1 should be specified as a URI in configuration files. Please update hdfs configuration.
14/03/06 12:17:35 WARN common.Util: Path /data/nn2 should be specified as a URI in configuration files. Please update hdfs configuration.
14/03/06 12:17:35 WARN common.Util: Path /data/nn1 should be specified as a URI in configuration files. Please update hdfs configuration.
14/03/06 12:17:35 WARN common.Util: Path /data/nn2 should be specified as a URI in configuration files. Please update hdfs configuration.
14/03/06 12:17:35 WARN common.Util: Path /data/nn1 should be specified as a URI in configuration files. Please update hdfs configuration.
14/03/06 12:17:35 WARN common.Util: Path /data/nn2 should be specified as a URI in configuration files. Please update hdfs configuration.
14/03/06 12:17:35 INFO namenode.HostFileManager: read includes:
HostSet(
)
14/03/06 12:17:35 INFO namenode.HostFileManager: read excludes:
HostSet(
)
14/03/06 12:17:35 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
14/03/06 12:17:35 INFO util.GSet: Computing capacity for map BlocksMap
14/03/06 12:17:35 INFO util.GSet: VM type       = 64-bit
14/03/06 12:17:35 INFO util.GSet: 2.0% max memory = 966.7 MB
14/03/06 12:17:35 INFO util.GSet: capacity      = 2^21 = 2097152 entries
14/03/06 12:17:35 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
14/03/06 12:17:35 INFO blockmanagement.BlockManager: defaultReplication         = 1
14/03/06 12:17:35 INFO blockmanagement.BlockManager: maxReplication             = 512
14/03/06 12:17:35 INFO blockmanagement.BlockManager: minReplication             = 1
14/03/06 12:17:35 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
14/03/06 12:17:35 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks  = false
14/03/06 12:17:35 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
14/03/06 12:17:35 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
14/03/06 12:17:35 INFO namenode.FSNamesystem: fsOwner             = hdfs (auth:SIMPLE)
14/03/06 12:17:35 INFO namenode.FSNamesystem: supergroup          = supergroup
14/03/06 12:17:35 INFO namenode.FSNamesystem: isPermissionEnabled = true
14/03/06 12:17:35 INFO namenode.FSNamesystem: Determined nameservice ID: testcluster
14/03/06 12:17:35 INFO namenode.FSNamesystem: HA Enabled: true
14/03/06 12:17:35 INFO namenode.FSNamesystem: Append Enabled: true
14/03/06 12:17:36 INFO util.GSet: Computing capacity for map INodeMap
14/03/06 12:17:36 INFO util.GSet: VM type       = 64-bit
14/03/06 12:17:36 INFO util.GSet: 1.0% max memory = 966.7 MB
14/03/06 12:17:36 INFO util.GSet: capacity      = 2^20 = 1048576 entries
14/03/06 12:17:36 INFO namenode.NameNode: Caching file names occuring more than 10 times
14/03/06 12:17:36 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
14/03/06 12:17:36 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
14/03/06 12:17:36 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
14/03/06 12:17:36 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
14/03/06 12:17:36 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
14/03/06 12:17:36 INFO util.GSet: Computing capacity for map Namenode Retry Cache
14/03/06 12:17:36 INFO util.GSet: VM type       = 64-bit
14/03/06 12:17:36 INFO util.GSet: 0.029999999329447746% max memory = 966.7 MB
14/03/06 12:17:36 INFO util.GSet: capacity      = 2^15 = 32768 entries
14/03/06 12:17:36 INFO common.Storage: Lock on /data/nn1/in_use.lock acquired by nodename 16755@hadoop1.com
14/03/06 12:17:36 INFO common.Storage: Lock on /data/nn2/in_use.lock acquired by nodename 16755@hadoop1.com
14/03/06 12:17:36 INFO namenode.FSImage: No edit log streams selected.
14/03/06 12:17:36 INFO namenode.FSImage: Loading image file /data/nn1/current/fsimage_0000000000000002539 using no compression
14/03/06 12:17:36 INFO namenode.FSImage: Number of files = 151
14/03/06 12:17:36 INFO namenode.FSImage: Number of files under construction = 0
14/03/06 12:17:36 INFO namenode.FSImage: Image file /data/nn1/current/fsimage_0000000000000002539 of size 16855 bytes loaded in 0 seconds.
14/03/06 12:17:36 INFO namenode.FSImage: Loaded image for txid 2539 from /data/nn1/current/fsimage_0000000000000002539
14/03/06 12:17:36 INFO namenode.NameCache: initialized with 3 entries 48 lookups
14/03/06 12:17:36 INFO namenode.FSNamesystem: Finished loading FSImage in 430 msecs
14/03/06 12:17:38 INFO namenode.FileJournalManager: Recovering unfinalized segments in /data/nn1/current
14/03/06 12:17:38 INFO namenode.FileJournalManager: Finalizing edits file /data/nn1/current/edits_inprogress_0000000000000002540 -> /data/nn1/current/edits_0000000000000002540-0000000000000002540
14/03/06 12:17:38 INFO namenode.FileJournalManager: Recovering unfinalized segments in /data/nn2/current
14/03/06 12:17:38 INFO namenode.FileJournalManager: Finalizing edits file /data/nn2/current/edits_inprogress_0000000000000002540 -> /data/nn2/current/edits_0000000000000002540-0000000000000002540
14/03/06 12:17:38 INFO client.QuorumJournalManager: Starting recovery process for unclosed journal segments...
14/03/06 12:17:38 INFO client.QuorumJournalManager: Successfully started new epoch 1
14/03/06 12:17:38 INFO namenode.EditLogInputStream: Fast-forwarding stream '/data/nn1/current/edits_0000000000000002540-0000000000000002540' to transaction ID 2540
14/03/06 12:17:38 INFO namenode.FSEditLog: Starting log segment at 2540
14/03/06 12:17:39 INFO namenode.FSEditLog: Ending log segment 2540
14/03/06 12:17:39 INFO namenode.FSEditLog: Number of transactions: 1 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 32
14/03/06 12:17:39 INFO util.ExitUtil: Exiting with status 0
14/03/06 12:17:39 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop1.com/192.168.143.131
************************************************************/

You might need to restart your HDFS service.

Start datanode service on hadoop3:
# su hdfs
$ ./start_datanode.sh

Check all the log files, make sure no error messages. At this point, you successfully configured your Hadoop cluster into a HA Hadoop cluster.

Verification:
You can verify from WebUI or command line:
From WebUI:
Open web browser on hadoop1:
http://hadoop1:50070
You should see the following:

Open web browser on hadoop2:
http://hadoop2:50070
You should see the following:

From command line:
[hdfs@hadoop1 ~]$ hdfs haadmin -getServiceState nn1
active
[hdfs@hadoop1 ~]$ hdfs haadmin -getServiceState nn2
standby

If you see both namenodes are in "standby" mode, probably you haven't set one of the namenode to be in "Active" states. To set one of the "standby" to "active":
$ hdfs haadmin -transitionToActive nn1


3 comments:

mareddyonline said...

wonderful information, I had come to know about your blog from my friend nandu , hyderabad,i have read atleast 7 posts of yours by now, and let me tell you, your website gives the best and the most interesting information. This is just the kind of information that i had been looking for, i'm already your rss reader now and i would regularly watch out for the new posts,
once again hats off to you! Thanks a ton once again, Regards
Hadoop Training in hyderabad

venkatsam said...

Hi I am facing error while executing hadoop commands after setting up HA.
if I submit

hdfs dfs -ls /

Its throwing unknownhostexception:testcluster.

Thanks
Venkat

venkatsam said...

I am facing an error after setting up namenode HA,

when I submit

hdfs dfs -ls /

It is throwing

unknownhostexceoption: testcluster