Tuesday, February 04, 2014

Zookeeper Error - Parent /cloudera_manager_zookeeper_canary missing, Unable to load database on disk

Trying to start one of Zookeeper server, the server refuse to start and the log file shows the following error message:

ERROR org.apache.zookeeper.server.persistence.FileTxnSnapLog 
Parent /cloudera_manager_zookeeper_canary missing for /cloudera_manager_zookeeper_canary/zookeeper1-SERVER-d75e87ec7c989094688ff05ac1e2c2e0

org.apache.zookeeper.server.persistence.FileSnap 
Reading snapshot /data1/zookeeper/version-2/snapshot.3a001026c6

Unable to load database on disk
java.io.IOException: Failed to process transaction type: 1 error: KeeperErrorCode = NoNode for /cloudera_manager_zookeeper_canary
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:188)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:156)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:79)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /cloudera_manager_zookeeper_canary
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:250)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:186)
... 6 more

The log file tells me that the server is not starting up due to it can't read data from "snapshot.3a001026c6". This file "snapshot.3a001026c6" could be corrupted. To clean the corrupted data files and regenerate new files, we need to delete all the files in datadir version-2. But before you do that, make sure all the other servers in your ensemble are up and working. You can use "stat" command to verify that:

# echo "stat" | nc zookeeper1.tony.com 2181
Zookeeper version: 3.4.5-cdh4.5.0--1, built on 11/20/2013 22:29 GMT
Clients:
 /10.6.70.35:60982[1](queued=0,recved=141,sent=141)
 /10.6.70.2:52908[1](queued=0,recved=148,sent=151)
 /10.6.70.2:52988[1](queued=0,recved=147,sent=147)
 /10.6.70.33:44230[1](queued=0,recved=261,sent=269)
 /10.6.70.3:33691[1](queued=0,recved=272,sent=274)
 /10.6.70.30:43581[1](queued=0,recved=252,sent=260)
 /10.6.70.3:35740[0](queued=0,recved=1,sent=0)
 /10.6.70.3:33639[1](queued=0,recved=345,sent=348)
 /10.6.70.33:44252[1](queued=0,recved=150,sent=150)
 /10.6.70.32:34600[1](queued=0,recved=146,sent=146)
 /10.6.70.30:43695[1](queued=0,recved=141,sent=141)

Latency min/avg/max: 0/0/11
Received: 2603
Sent: 2777
Connections: 11
Outstanding: 0
Zxid: 0x3d00000468
Mode: leader
Node count: 48

After you have verified that all the other servers of the ensemble are up, you can go ahead and clean the database of the corrupt server. Delete all the files in datadir/version-2 and datalogdir/version-2/. Restart the server.

1 comment:

Rick McBride said...

Thanks! This saved my evening.