Sunday, February 24, 2019

ECE - Small Trick: Query Zookeeper Statistics

ECE stands for "Elastic Cloud Enterprise". It shares most of its codebase with Elastic Cloud. The key tenets of the architecture are:
  • Service-oriented architecture
  • Containerization using Docker
  • Deployment state coordination using ZooKeeper
  • Easy access through the Cloud UI

Since ECE is a service-oriented architecture, it makes scale the platform very easy. Different services can have different reliability and performance requirements, as each service can be scaled separately. This also hides some of technical details from you, especially administrators that curious how it works inside. Sometimes it could be really hard to get information out of a vendor container.

In our case, we have ECE 1.x and 2.x both installed in our environment, one of challenges we had was how to monitor the Zookeeper status. The Zookeeper status from the admin console doesn't count :). Fortunately, ECE Zookeeper container exposes its port to "0.0.0.0" (i.e, 0.0.0.0:2192->2192/tcp), this means you can can query some of whitelisted information through the host IP and exposed port.

For example, say one of your Zookeeper role exposes port "2192". If you would like to output a list of variables that could be used for monitoring the health of the cluster. You could query the underline host that runs Zookeeper role like following:
$ echo mntr | nc ecedc1h1.lixu.ca 2192

zk_version  3.4.0
zk_avg_latency  0
zk_max_latency  0
zk_min_latency  0
zk_packets_received 70
zk_packets_sent 69
zk_outstanding_requests 0
zk_server_state leader
zk_znode_count   4
zk_watch_count  0
zk_ephemerals_count 0
zk_approximate_data_size    27
zk_followers    4                   - only exposed by the Leader
zk_synced_followers 4               - only exposed by the Leader
zk_pending_syncs    0               - only exposed by the Leader
zk_open_file_descriptor_count 23    - only available on Unix platforms
zk_max_file_descriptor_count 1024   - only available on Unix platforms

Then, if you really want to go fancy, you could send the output to a monitoring and alerting system like "Datadog" peoridically (I might make another post for this). But for information query, this is a nice and easy way.

Our Zookeeper version is v3.5.3, and a few useful commands:


  • conf: rint details about serving configuration (not in the whitelist).
  • cons: List full connection/session details for all clients connected to this server. Includes information on numbers of packets received/sent, session id, operation latencies, last operation performed, etc... (not in the whitelist).
  • dump: Lists the outstanding sessions and ephemeral nodes. This only works on the leader (not in the whitelist).
  • envi: Print details about serving environment (not in the whitelist).
  • ruok: Tests if server is running in a non-error state. The server will respond with imok if it is running. Otherwise it will not respond at all.
  • A response of "imok" does not necessarily indicate that the server has joined the quorum, just that the server process is active and bound to the specified client port. Use "stat" for details on state wrt quorum and client connection information.
  • srvr: Lists full details for the server.
  • stat: Lists brief details for the server and connected clients.
  • wchs: Lists brief information on watches for the server (not in the whitelist).
  • wchc: Lists detailed information on watches for the server, by session. This outputs a list of sessions(connections) with associated watches (paths). Note, depending on the number of watches this operation may be expensive (ie impact server performance), use it carefully (not in the whitelist).
  • wchp: Lists detailed information on watches for the server, by path. This outputs a list of paths (znodes) with associated sessions. Note, depending on the number of watches this operation may be expensive (ie impact server performance), use it carefully.
  • mntr: Outputs a list of variables that could be used for monitoring the health of the cluster.

No comments: