Friday, October 04, 2013

Cloudera Manager - Where do you store your configuration parameters?


If you use Cloudera Manager to install Hadoop, sometimes you might be surprised that nothing has been effected after modifying /etc/hadoop/conf and a restart of HDFS service. This is because service instances started by Cloudera Manager do not read configurations from the default locations. That's Cloudera's way of managing Hadoop configurations. Use HDFS as an example, when not managed by Cloudera Manager, there would usually be one HDFS configuration per host, located at /etc/hadoop/conf/hdfs-site.xml. Server-side daemons and clients running on the same host would all use that same configuration.

Cloudera use database to store configuration and monitoring information. In CM, when you udpate a configruation, you are actually updating the "model" state. Cloudera Manager models the Hadoop stack: its roles, configurations, and inter-dependencies. Model state captures what is supposed to run where, and with what configurations.

For example, if you want to find out the one of the configuration paramater, let's say "dfs_name_dir_list", open a terminal, connect to the CM Manager database (I use postgres):

# If you don't know the password, check "/etc/cloudera-scm-server/db.properties"

# psql -h localhost -U scm
scm=> select * from configs where attr = 'dfs_name_dir_list';

It will show your all the roles that use this parameter.

The file /etc/hadoop/conf/hdfs-site.xml contains only configuration relevant to an HDFS client. By default, if you run a program that needs to communicate with Hadoop, it will get the addresses of the NameNode and JobTracker, and other important configurations, from that directory.

Hadoop services/daemon get their configuration from the process directory:
The HDFS server-side daemons (for example, NameNode and DataNode) obtain their configurations from a private per-process directory, under /var/run/cloudera-scm-agent/process/unique-process-name. If you go into /var/run/cloudera-scm-agent/process, you will see there are tons of sub directories with the format "xxx-hdfs-NAMENODE". Each time you update a service's configuration and do a restart, a new directory will be created for that process with a new ID. The purpose of gving each process its own private execution and configuration environment is it allows us to control each process independently, which is crucial for some of the more esoteric configuration scenarios that show up.

# tree -a 4473-hdfs-NAMENODE
4473-hdfs-NAMENODE
├── cloudera_manager_agent_fencer.py
├── cloudera_manager_agent_fencer_secret_key.txt
├── cloudera-monitor.properties
├── core-site.xml
├── dfs_hosts_allow.txt
├── dfs_hosts_exclude.txt
├── event-filter-rules.json
├── hadoop-metrics2.properties
├── hdfs.keytab
├── hdfs-site.xml
├── http-auth-signature-secret
├── log4j.properties
├── logs
│   ├── stderr.log
│   └── stdout.log
├── topology.map
└── topology.py

1 directory, 16 files

If you are interested in the full story and how CM works in details, read "Cloudera Manager Primer".

No comments: