Thursday, April 28, 2016

Puppet - Corrupted KahaDB for puppetdb

PuppetDB uses ActiveMQ for queuing commands, both those received via the API and sometimes those initiated internally. The queue utilizes a technology built for ActiveMQ called "KahaDB". "KahaDB" is a file-based persistence database that designed for high-performance queuing.

"KahaDB" is located in "$vardir"/localhost, you can find your "vardir" from "config.ini" (In my case is "/etc/puppetdb/conf.d/config.ini"). A snapshot of "KahaDB" directory:
# tree -CDups /var/lib/puppetdb/mq/localhost/

/var/lib/puppetdb/mq/localhost/
├── [drwxr-xr-x puppetdb          60 Apr 28 11:28]  KahaDB
│   ├── [-rw-r--r-- puppetdb    33030144 Apr 28 12:02]  db-1.log
│   ├── [-rw-r--r-- puppetdb       32768 Apr 28 12:02]  db.data
│   ├── [-rw-r--r-- puppetdb       28720 Apr 28 12:02]  db.redo
│   └── [-rw-r--r-- puppetdb           0 Apr 28 11:28]  lock
├── [drwxr-xr-x puppetdb          52 Apr 27 16:34]  KahaDB.old
│   ├── [-rw-r--r-- puppetdb    33030144 Apr 27 14:45]  db-1664.log
│   ├── [-rw-r--r-- puppetdb       32768 Apr 27 14:45]  db.data
│   └── [-rw-r--r-- puppetdb       28720 Apr 27 14:45]  db.redo
└── [drwxr-xr-x puppetdb          76 Apr 28 11:28]  scheduler
    ├── [-rw-r--r-- puppetdb           0 Jun  3  2015]  db-1.log
    ├── [-rw-r--r-- puppetdb           0 Apr 28 11:28]  lock
    ├── [-rw-r--r-- puppetdb       20480 Apr 28 11:28]  scheduleDB.data
    └── [-rw-r--r-- puppetdb       16408 Jun  3  2015]  scheduleDB.redo

In some cases, "KahaDB"’s storage might become corrupt or simply unreadable due to the version of PuppetDB that you’ve launched. There are a number of possible causes, including:
    - Disk running out of space
    - Bug in "KahaDB"
    - PuppetDB upgrade or downgrade
    - Other unknown causes

If you have a corrupted "KahaDB", you will see the following in puppetdb.log:
java.io.IOException: Unable to start broker in "/var/lib/puppetdb/mq". This is probably due to KahaDB corruption or version incompatibility after a PuppetDB downgrade (see "KahaDB Corruption" in the PuppetDB manual).

To recover, the simplest way is to move the "KahaDB" directory out of the way and restart PuppetDB:

# service puppetdb stop
# cd /opt/puppetlabs/server/data/puppetdb/mq/localhost
# mv KahaDB KahaDB.old
# service puppetdb start

In most cases the above steps will solve the problem, though in the process you might lose any queued, unprocessed data (data that had not reached PostgreSQL yet). Re-running Puppet on your nodes should normally resubmit the lost commands.

Other solutions:

    - Clear your db.data file and recreate it. The db.data file represents your index, and clearing it may force the file to be recreated from the logs.
    - Clear your db-*.log files, which contain the journal. While KahaDB is generally good at finding pinpoint corruption and ignoring these today (in fact much better since PuppetDB 1.1.0) there are still edge cases. Clearing them may let you skip over these bad blocks. It might be that only 1 of these files are corrupted, and the remainder are good so you could attempt clearing one at a time (newest first) to find the culprit.

If you are super paranoied, regularly backup your "KahaDB" directory.

No comments: