Root Cause & Remediation
Consumer pod crashloop, poison pill message halting the partition offset, insufficient consumer instances during a traffic spike, or downstream database bottlenecks slowing the consumer's commit rate.
Remediation steps
- 1Check consumer group offsets using kafka-consumer-groups.sh or your observability dashboard.
- 2Identify poison pill messages blocking the partition and route them to a Dead Letter Queue (DLQ).
- 3Scale the consumer deployment up (up to the number of partitions in the topic).
- 4Check the downstream database or API that the consumer writes to for bottlenecks or rate limits.
- 5If acceptable for the business use-case, skip the offset to bypass the backlog, acknowledging data loss.