Session Type
Breakout Session
Name
Kafka Connection Chaos: Surviving the Storm
Date
Wednesday, May 21, 2025
Time
2:00 PM - 2:45 PM
Location Name
Breakout Room 7
Description
It is 9 AM, support team began the maintenance to renew Kafka Broker's certificates. At 9:30 AM half of the cluster has been updated correctly, but, the liveness probe metric seems unstable. We check connectivity — everything looks fine. Our monitoring stack shows it is able to consume and produce from/to all brokers. Connections are a bit higher than usual but still within limits. 9:40 AM: some teams start complaining that they can neither consume nor produce. What is happening? Suddenly, we discover the acceptor metric indicating that brokers are blocking 80% of connections. What is an acceptor, and why is it blocking our connections? The previous paragraph describes an incident where our Kafka platform experienced a connection storm, leading to significant degradation. This event highlighted the crucial need for effective connection management and exposed our gaps in understanding Kafka’s connection handling, especially with new connections. In this talk, we will share our journey and insights with platform teams maintaining Kafka. You’ll learn how Kafka on Linux servers manages connections and the challenges you might encounter. We will dive into the metrics and mechanisms Kafka offers to detect and protect against connection storms. And last but not least, we’ll share tips from our experience to help you avoid the mistakes we made.
Javier Hortal Rafael García Ortega
Level
Intermediate
Target Audience
Architect, Operator/Administrator, Executive (Technical)
Industry
Manufacturing, Retail/E-Commerce
Tags
Tales from the trenches, Apache Kafka, Operations, Systems