Started June 1, 2025
In Progress
Real-Time Clickstream Analytics
Event-driven pipeline for real-time SKU click ingestion using Kafka, Flink, and Iceberg.
flink iceberg kafka realtime minio
Overview
This project simulates and processes high-volume SKU click events using a real-time pipeline built on Kafka, Flink, and Iceberg. It stores queryable data in Iceberg tables backed by MinIO and integrates with Grafana for aggregation monitoring.
It demonstrates:
- Sub-minute ingestion and transformation latency
- Schema evolution safety via Apache Iceberg
- Durability and reprocessing using Kafka retention
- Clean separation between dev and deploy environments
Architecture
- Kafka receives generated click events (up to 50K/sec)
- Flink (custom JAR job) normalizes and loads the data into Iceberg
- Iceberg is backed by MinIO using a Hadoop catalog
- Grafana queries the Iceberg tables and tracks lag metrics
Status
This project is In Progress (as endDate is not set). The core ingestion pipeline is stable, but the backpressure testing and auto-scaling components are still being tuned.
Challenges
- Flink SQL CLI had compatibility issues with table catalogs
- Iceberg required bootstrap hacks due to default DB assumptions
- MinIO ACLs and Flink versioning caused silent failures
Lessons Learned
- Always use small batch replays before enabling live ingestion
- Avoid SQL CLI for production workloads — favor compiled jobs
- Iceberg's flexibility comes with extra catalog config friction
Related Services
- Kafka to Flink ingestion design
- Iceberg migration and tuning
- Schema evolution strategy
- Real-time monitoring pipelines
- AWS-compatible S3 object modeling
Recent Activity
Completed Iceberg table migration with zero downtime using dual-write pattern.
Optimized Flink job performance, reduced processing latency by 40%.