Ingesting data from Apache Kafka into Apache Iceberg presents a recurring challenge in modern ETL workflows. The conventional approach relies on connectors, yet this method introduces operational hurdles due to the fundamental differences between these systems. Kafka excels at real-time streaming workloads, while Iceberg is optimized for analytical data storage and batch ingestion. Bridging these paradigms creates several inefficiencies:
- Batch Operations on Streaming Storage: Attempting batch operations on Kafka, a system designed for streaming, results in ingestion bottlenecks and increased strain on Kafka brokers. One example is initial table hydration, where historical data retrieval often means uncached reads. This significantly delays topic-to-table hydration, impacting broker performance and straining resources in latency-sensitive environments.
- Streaming Operations on Batch Storage: Applying streaming-like ingestion patterns to Iceberg generates numerous small Parquet files. These files pollute Iceberg’s metadata, degrade query performance, and increase the need for maintenance operations.
- Lack of Unified Table Maintenance: Aggressive creation of small files containing updates will conflict with maintenance operations running in the background, leading to wasteful retries.
In this talk, Alex will share insights and lessons learned from building Tableflow, a unified batch/streaming storage system that allowed us to address all three. He will talk about specific solutions implemented in the Kora storage engine that mitigate these issues, making both systems work cohesively. Attendees will gain actionable knowledge on overcoming operational challenges, implementing innovative solutions, and designing scalable pipelines that maximize the potential of both Kafka and Iceberg.
