Wednesday, June 19, 2013

Hadoop - zookeeper error Command aborted because of exception: Command timed-out after 150 seconds

I recently did a restart on our Hadoop cluster, after I stopped all the processes, I had trouble to restart zookeeper. The error message shows at the starting up time is "Command aborted because of exception: Command timed-out after 150 seconds".

In the zookeeper log file, I see this:

2013-06-18 16:58:15,261 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x13efb97bacd0006, likely client has closed socket
 at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
 at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
 at java.lang.Thread.run(Thread.java:662)

 The information in the log file didn't help at all. After poking around servers, I found even though in the CM UI, the flume service shows "Stopped", but on each of the flume servers, there are still flume process running. I had to manually killed the flume process on each server, then cluster started without any problem.

No comments: