Recently we got stuck in a situation to find a Kafka topics’s consumer which was creating an issue and turned out to difficult to identify.
We started noticing that at times we would get re-balance Kafka consumer group and suddenly we would notice our application consumer stop getting messages . And then before we could run some diagnostics it would re-balance again and our consumer would start getting messages but we would have lost some messages in between .
- We tried to identify the consumer IP address using Kafka-consumer-groups.sh : This command for a given group id shows the the IP address of the consumer for a given topic . But as mentioned in the problem our group would re-balance before we realized and we wont know who was the consumer in that intermittent period .
- Changing client side logging for apache to debug : We enabled Kafka logging on client side for package org.apache.kafka.clients.consumer to debug level . This starts printing lots of information like heartbeat and committing of offsets . But what we were interested in the logs printed about re-balance of given consumer group . It showed us that post re-balance our consumer lost control of the topic subscription . This confirmed that there was another mysterious consumer who was snatching this control . But still we didn’t knew who was it and where was it .
- Changing the Kafka server logging to debug : So we went to Kafka server side . Stopped all the Kafka servers in the cluster except one and changed the logs in log4j.properties in config folder as shown below
This started printing all information and we started drilling server.log file of Kafka server and hola!! we found it . At the start of Kafka server we found couple of lines as below
[2019-07-18 06:01:31,773] DEBUG Processor 0 listening to new connection from /10.50.1.2:40404 (kafka.network.Processor) [2019-07-18 06:01:35,476] DEBUG Processor 1 listening to new connection from /10.55.3.49:60514 (kafka.network.Processor)
We knew one of the server but didn’t identify the other one so we looked into it there it was trying to create connection causing re-balance and stealing messages .
So what’s next we killed it and all’s well !!