Thursday, January 02, 2014

Hadoop - datanode can't connect to namenode

This blog shows you how to debug the Hadoop issue "datanoed can't connect to namenode".

First of all, you need to understand how datanode communicate with namenode. Heartbeats are the mechanism by which the NameNode determines which DataNodes are currently active. In namenode there is a HeartbeatMonitor thread that controls the management. NameNode maintains information about the DataNodes in DataNodeDescriptor.java. DataNodeDescriptor tracks statistics on a given DataNode, such as available storage capacity, last update time, etc., and maintains a set of blocks stored on the DataNode. This data structure is internal to the NameNode.

By default datanode heartbeat into namdenode every 3 seconds. This value can be changed in hdfs-default.xml file. If a DataNode fails to send a heartbeat for a long time (e.g., 10 minutes), then the HeartbeatMonitor will decide the DataNode is dead, and consider its replicas to be no longer available.

If the datanode process keeps telling you that it can't connect to namdenode, you can see this kind of error message in datanoed log:

2013-06-02 12:41:21,378 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: master/xxx.xxx.xxx.xxx:8020. Already tried 3
time(s).

You should check the following things:

1. Hardware issue, the cable connecting datanode to the rack switch is broken. Log into the datanode, try to ping namenode or ssh to namenode, make sure namenode is reachable from datanode.

2. The firewall on namenode, if it has a filewall, make sure it doesn't block the traffic:
# service iptables status
Table: filter
Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination        
1    fail2ban-BadBots  tcp  --  0.0.0.0/0            0.0.0.0/0           multiport dports 80,443
2    fail2ban-SSH  tcp  --  0.0.0.0/0            0.0.0.0/0           tcp dpt:22

or

# iptables -L

The easiest way to test is save current rules into a file "service iptables save", then turn off the firewall "service iptables stop", see if datanode can connect tp namenode now.

3. Make sure namdenode process is not binding to the wrong IP. Namenode uses the value of fs.default.name (8020 default) to decide which port it runs the HDFS protocol. Do a:

# netstat -atnp | grep 8020
tcp        0      0 0.0.0.0:8020                0.0.0.0:*                   LISTEN      28046/java

Make sure it binds to the wildcard address ("0.0.0.0") on 8020. If you see something like:

# netstat -atnp | grep 8020
tcp        0      0 127.0.0.1:8020                0.0.0.0:*                   LISTEN      28046/java

You will have a problem, this means the RPC port is listening only on the loopback address 127.0.0.1.

4. The dfs.hosts and dfs.hosts.exclude lists could have been denying datanode registration. Check the dfs.hosts.exclude property in "hadoop-site.xml", example:
<property>
  <name>dfs.hosts.exclude</name>
  <value>/home/hadoop/excludes</value>
  <final>true</final>
</property>

make sure the datanode IP is not listed.

5. Check the /etc/hosts file on namenode, make sure the hostname is not resolving as 127.0.0.1 (or 127.0.0.1 is not resolving to the hostname).
# host -t A -v namnode_hostname
# cat /etc/hosts

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 namenode_hostname


In the above example, the "namenode_hostname" is listed in line "127.0.0.1 ...", this will cause problem.

No comments: