Monday, September 08, 2014

Cloudera Manager - Error Sending Messages to Firehose, No Enough Data to Test.

Our Hadoop cluster recently lost one of the namenode (Hardware error), even though we have namenode HA enabled, but there are some import services such as Zookeeper, Journalnode, ..etc on that lost server, so we still had some down time. Finally I managed to add the replacement server into our Hadoop cluster and brought back the whole cluster online. After everything is up, there are two datanodes keep giving the following errors:
Error sending messages to firehose: mgmt1-SERVICEMONITOR-46ebf1bb9c51277b3bd7cc6398f28303
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/src/cmf/monitor/firehose.py", line 70, in _send
    self._port)
  File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 471, in __init__
    self.conn.connect()
  File "/usr/lib64/python2.6/httplib.py", line 720, in connect
    self.timeout)
  File "/usr/lib64/python2.6/socket.py", line 567, in create_connection
    raise error, msg
error: [Errno 111] Connection refused

All datande have the same hardware configuration and  same software packages.

In the datanode status page in CM, it says "The health of this role's host was concerning. The following health checks were concerning: agent status.". Like the picture below:

A soft restart of CM agent didn't help:
# /etc/init.d/cloudera-scm-agent restart
Stopping cloudera-scm-agent:                               [  OK  ]
Starting cloudera-scm-agent:                               [  OK  ]

Soft restart only restarts scm agent process, not all the processes that managed by CM.

You need to do a hard restart, to restart the supervisord process to let the error go away, at least in my case.
# /etc/init.d/cloudera-scm-agent hard_restart
Stopping cloudera-scm-agent:                               [  OK  ]
Stopping supervisord:                                      [  OK  ]
Starting cloudera-scm-agent:                               [  OK  ]

Since Cloudera Manager uses an open source supervisor called supervisord that takes care of redirecting log files, notifying of process failure, setting the effective user ID of the calling process to the right user, and so forth. "hard_restart" restarts agents, the supervisord process, and all processes managed by the supervisord process. Of course the datanode will become "Bad" for a short while, but it will be come "OK" in the next hearbeat.

No comments: