Within Uber, we have numerous Kafka clusters comprising thousands of nodes tailored to different use cases. These clusters collectively handle a few trillion messages daily, amassing multiple Petabytes of data ingestion. These messages are distributed across thousands of topics. Several of these clusters are exceptionally large, exceeding 150 nodes in some cases. Kafka serves a crucial role in enabling inter-service communication, transporting database changelogs, facilitating data lake ingestion, and more. Notably, Kafka houses business-critical data like billing and payment information. Kafka is a tier-0 technology at Uber, guaranteeing 99.99% data durability, and its availability is tied to the health of the underlying nodes. However, these nodes are ageing, leading to increasing disk failures and the need for replacements, with potential risks of offline partitions and data loss. To ensure uninterrupted operations, there is a need to migrate topics and partitions to newer, high-performance SKUs. The migration introduces several challenges: 1. Preserving rack-aware distribution to maintain zone failure resiliency during the migration. 2. Managing significant differences in disk capacity between the old SKU (legacy) nodes and the new SKU (H20A) nodes. 3. Adhering to disk usage thresholds on the new SKU nodes to avoid performance degradation. 4. Balancing nodes within racks to ensure continuous resiliency and fault tolerance. 5. Handling variability in Kafka cluster configurations, especially for low-latency clusters, where introducing new replicas could increase latency. Join us to learn how we overcame these challenges using strategies like tiered storage and cluster rebalance to successfully migrate Kafka infrastructure at Uber.

