Kafka & Streaming Reliability

Kafka reliability (the stuff that matters in prod)

  • Topic design: partitions, replication factor, retention, compaction when needed.
  • Consumers: group stability, offset mgmt, idempotency, retries, and DLQ patterns.
  • Backpressure: protect downstream systems; rate limit and batch responsibly.
  • Observability: lag (per group/partition), rebalance rate, produce/consume errors, throughput.

Debugging playbook (quick checklist)

  • Lag spike: input surge, consumer slowdown, rebalance, or hot partition?
  • Check: consumer errors, commit rate, rebalance count, broker health, ISR, network.
  • Fix: scale consumers, tune max.poll, increase partitions carefully, isolate hot keys.

Links