Managing large-scale Kafka clusters is both a technical challenge and an art. At Trendyol, our Data Streaming team operates Kafka as the backbone of a vast event-driven ecosystem, ensuring stability and seamless client experiences. However, we faced recurring issues during broker restarts—applications experienced connectivity errors due to misconfigured topics and improper bootstrap server configurations. To address this, we leveraged Confluent Stretch Kafka across multiple data centers, enabling automatic leader elections without service disruptions. Additionally, we enforced topic creation and alter policies and built a custom Prometheus exporter to detect misconfigured topics in real time, allowing us to notify owners and take corrective actions proactively. Through rigorous alerting mechanisms and enforcement via our Internal Development Platform (IDP), we have successfully eliminated disruptions during broker restarts, enabling smooth cluster upgrades and chaos testing. This session will provide practical insights into architecting resilient Kafka deployments, enforcing best practices, and ensuring high availability in a production environment handling thousands of clients.
Attendees will learn:
- How multi-DC Kafka clusters ensure client continuity
- The impact of misconfigured replication factors and how to prevent them
- How real-time monitoring and alerts reduce operational risks
- Practical strategies to enforce resilient topic configurations
Mehmetcan Güleşçi, DSM GRUP DANIŞMANLIK İLETİŞİM VE SATIŞ TİCARET ANONİM ŞİRKETİ

