As a transitional step, this site will temporarily be made Read-Only from July 8th until the new community launch. During this time, you can still search and read articles and discussions.

While the community is read-only, if you have questions or issues requiring TIBCO review/response, please access the new TIBCO Community and select "Ask A Question."

You will need to register or log in or register to engage in the new community.

Monitoring Dockerized Kafka Using TIBCO Hawk JMX Plug-in - Community Edition

By:
Last updated:
4:32pm Aug 19, 2020

TIBCO Hawk® provides the industry's best, sophisticated tool for monitoring and managing distributed applications and systems throughout the enterprise. With Hawk, system administrators can monitor application parameters, behavior, and loading activities for all nodes in a local or wide-area network and take action when pre-defined conditions occur. In many cases, runtime failures or slowdowns can be repaired automatically within seconds of their discovery, reducing unscheduled outages and slowdowns of critical business systems. 


This article talks about how you can use TIBCO Hawk Container Edition and TIBCO Hawk(R) Plug-in for JMX - Community Edition to monitor your dockerized Kafka deployments.

Introduction to Kafka

Apache Kafka is an open-source publish-subscribe based messaging system. It is used for building real-time streaming data pipelines and streaming applications. Here is in short how Kafka works:

  • Kafka is run as a cluster on one or more servers that can span multiple datacenters. These Kafka servers are also called as Brokers.
  • The Kafka cluster stores stream of records in categories called topics.
  • The Kafka topics are divided into distributed partitions for load balancing and replication.
  • The Kafka producers send records to the topics
  • The Kafka consumers read data published to the particular topic.

Monitoring Dockerized Kafka Using Hawk JMX Plug-in

Kafka provides many performance metrics via JMX. We can use Hawk JMX plug-in to convert these metrics into Hawk microagents and extend Hawk’s monitoring and management capability to the Kafka world. The metrics exposed by Kafka include metrics for Kafka broker, producer, and consumer. The details about these metrics can be found at the official page of Kafka: http://kafka.apache.org/documentation/#monitoring

TIBCO Apache Docker Distribution: https://hub.docker.com/r/wurstmeister/kafka/

Prerequisites

Adding Hawk JMX Plug-in in the hkce_agent container

To use Hawk JMX Plug-in, we need to add this plugin into the hkce_agent container. The steps to add any custom plugin in hkce_agent are available at https://docs.tibco.com/pub/hkce/1.0.0/doc/html/GUID-F12317DD-CC65-42FD-9F8D-A42B5DD14F5D.html

To add Hawk JMX Plug-in in the hkce_agent container, please follow these steps:

  • Download Hawk JMX Plug-in community edition from here.
  • Extract the zip file and copy all the files (including hawkjmxhma.jar, JMXPluginConfig.xml, and JMXServiceMA.hma) and paste into the folder <TEMP_DIRECTORY>/tibco.home/hkce/1.0/plugin where <TEMP_DIRECTORY> is the location where you have extracted the Hawk Container Edition software package.
  • Edit JMXPluginConfig.xml file and set the JMXServiceURL parameter to the JMX endpoint of your Kafka container. Since Kafka provides a number of MBeans, you can use MBeanFilter parameter to filter out the MBeans that you want as a Hawk microagent.
  • Build the hkce_agent docker image. Follow the documentation steps here.
  • Run all Hawk Container Edition containers in the standalone mode using docker-compose.
  • If you have Hawk console container running, you can access it at http://<Console_host_IP>:<Host_port>/HawkConsole. Here you should be able to see the Kafka Mbeans available as the Hawk microagents and you can start creating rulebases for monitoring Kafka.

The JMXPluginConfig.xml provided with the Hawk JMX Plug-in community edition has a configuration for only one MBeanServer. Since Kafka has multiple components (Kafka broker, producer, consumer) which expose the JMX metrics. We can use the same JMXPluginConfig.xml file to monitor all these components. For that you can add multiple <MBeanServer> configurations under <MBeanServerList>.

Kafka Metrics to Monitor

Kafka provides a number of MBeans. But they all may not be relevant from the monitoring perspective. Here is the list of sample Kafka MBeans, their equivalent Hawk microagent-methods and sample Hawk Rule test conditions for generating an alert. There are lots of Kafka MBeans and corresponding Hawk microagents available. The table below lists only a few samples.

Kafka broker metrics:
Metric MBean Hawk Microagent Hawk Method Sample Hawk Rule Test
Number of offline partitions kafka.controller:type=KafkaController,name=OfflinePartitionsCount kafka.controller:type=KafkaController,name=OfflinePartitionsCount _getValue Value>0
The average fraction of time the network processors are idle kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent _getValue Value < 0.3(As suggested by official Kafka doc)
FetchConsumer request rates kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsume kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsume _getMean  
FetchFollower request rates kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetcFollower kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower _getMean  
Produce request rates kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce _getMean  
Total time spent for FetchConsumer requests kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer _getMean  
Total time spent for FetchFollower requests kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower _getMean  
Total time spent for Produce requests kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce _getMean  
Count of under replicated partitions kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions _getValue Value>0
Count of under minIsr partitions, i.e. (|ISR| < min.insync.replicas) kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount _getValue Value>0
Offline replica count kafka.server:type=ReplicaManager,name=OfflineReplicaCount kafka.server:type=ReplicaManager,name=OfflineReplicaCount _getValue Value>0
ISR shrink rate kafka.server:type=ReplicaManager,name=IsrShrinksPerSec kafka.server:type=ReplicaManager,name=IsrShrinksPerSec _getOneMinuteRate If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.
ISR expansion rate kafka.server:type=ReplicaManager,name=IsrExpandsPerSec kafka.server:type=ReplicaManager,name=IsrExpandsPerSec _getOneMinuteRate See above
Number of incoming messages per second kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec _getCount(For total messages),_getOneMinuteRate(messages/minute)  
Byte in rate from clients kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec _getOneMinuteRate  
Byte out rate to clients kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec _getOneMinuteRate  
Kafka consumer metrics:
Metric MBean Hawk Microagent Hawk Method Sample Hawk Rule Test
The average time taken for a commit request kafka.consumer:type=consumer-coordinator-metrics,client-id={client_id} kafka.consumer:type=consumer-coordinator-metrics,client-id={client_id} _getcommit-latency-avg  
The average number of heartbeats per second kafka.consumer:type=consumer-coordinator-metrics,client-id={client_id} kafka.consumer:type=consumer-coordinator-metrics,client-id={client_id} _getheartbeat-rate  
The number of seconds since the last controller heartbeat. kafka.consumer:type=consumer-coordinator-metrics,client-id={client_id} kafka.consumer:type=consumer-coordinator-metrics,client-id={client_id} _getlast-heartbeat-seconds-ago  
Number of messages the consumer lags behind the producer by. Published by the consumer, not broker. kafka.consumer:type=consumer-fetch-manager-metrics,client-id={client-id} kafka.consumer:type=consumer-fetch-manager-metrics,client-id={client-id} _getrecords-lag-max Records-lag-max > some_predefined_value
The average number of records consumed per second for a topic kafka.consumer:type=consumer-fetch-manager-metrics,client-id={client-id},topic={topic} kafka.consumer:type=consumer-fetch-manager-metrics,client-id="client-id",topic="topic" _getrecords-consumed-rate  
Max request latency between consumer and the broker node kafka.consumer_type=consumer-node-metrics,client-id={client-id},node-id={node-id} kafka.consumer_type=consumer-node-metrics,client-id={client-id},node-id={node-id} _getrequest-latency-max  

Note:

  1. In the above table, {client-id} represents the id of the kafka consumer. So for each consumer in your application there will be corresponding MBean and microagent.

  2. Similarly {topic} represents the kafka topic that your kafka consumer is consuming. So for each such consumer-topic pair, there will be corresponding MBean and microagent.

  3. Similar is the case for {node-id} which represents the broker node id. So there will be MBean and microagent for each consumer-node pair.

Kafka producer metrics:

 

Metric MBean Hawk Microagent Hawk Method Sample Hawk Rule Test
The maximum time in ms a request was throttled by a broker kafka.producer:type=producer-metrics,client-id={client-id} kafka.producer:type=producer-metrics,client-id={client-id} _getproduce-throttle-time-max  
The total number of connections closed kafka.producer:type=producer-metrics,client-id={client-id} kafka.producer:type=producer-metrics,client-id={client-id} _getconnection-close-total  
The number of network operations (reads or writes) on all connections per second kafka.producer:type=producer-metrics,client-id={client-id} kafka.producer:type=producer-metrics,client-id={client-id} _getnetwork-io-rate  
The average request latency in ms kafka.producer_type=producer-node-metrics,client-id={client-id},node-id={node-id} kafka.producer_type=producer-node-metrics,client-id={client-id},node-id={node-id} _getrequest-latency-avg  
The average per-second number of retried record sends for a topic kafka.producer:type=producer-topic-metrics,client-id={client-id},topic={topic} kafka.producer:type=producer-topic-metrics,client-id={client-id},topic={topic} _getrecord-retry-rate Rule to check if retry rate is high
The average per-second number of record sends that resulted in errors for a topic kafka.producer:type=producer-topic-metrics,client-id={client-id},topic={topic} kafka.producer:type=producer-topic-metrics,client-id={client-id},topic={topic} _getrecord-error-rate Rule to check if error rate is high

Note:

  1. In the above table, {client-id} represents the id of the kafka producer. So for each producer in your application there will be corresponding MBean and microagent
  2. Similarly {topic} represents the kafka topic that your kafka producer is sending the data to. So for each such producer-topic pair, there will be corresponding MBean and microagent.

  3. Similar is the case for {node-id} which represents the broker node id. So there will be MBean and microagent for each producer-node pair.