All errors
KAFKA-LAGMessaging

Kafka Consumer Lag / Event Backlog

A Kafka consumer group falls critically behind the producer rate, causing stale data in downstream read models, delayed notifications, or stalled asynchronous transaction processing.

Root Cause & Remediation

Consumer pod crashloop, poison pill message halting the partition offset, insufficient consumer instances during a traffic spike, or downstream database bottlenecks slowing the consumer's commit rate.

Remediation steps

  1. 1Check consumer group offsets using kafka-consumer-groups.sh or your observability dashboard.
  2. 2Identify poison pill messages blocking the partition and route them to a Dead Letter Queue (DLQ).
  3. 3Scale the consumer deployment up (up to the number of partitions in the topic).
  4. 4Check the downstream database or API that the consumer writes to for bottlenecks or rate limits.
  5. 5If acceptable for the business use-case, skip the offset to bypass the backlog, acknowledging data loss.

DORA Risk Matrix

Typical classification
Context-dependent
Likelihood
Medium
Blast radius
Affects all asynchronous operations depending on the topic. Real-time dashboards, async order processing, and email notifications degrade.
CIF impact
Stale account balances, delayed trade confirmations, or stalled onboarding workflows.
Analyst notes
Often MINOR unless the lag affects a Critical or Important Function (CIF) such as fraud screening or live balance updates, breaching the 2-hour downtime threshold.

Ready to classify this incident?

Use the DoraPulse Triage Calculator to instantly determine if this event breaches DORA materiality thresholds and generate a ready-to-file regulatory draft for your internal compliance team.

Open Triage Calculator — Pre-filled for Kafka Consumer Lag / Event Backlog