Author

Started June 1, 2025

In Progress

Real-Time Clickstream Analytics

Event-driven pipeline for real-time SKU click ingestion using Kafka, Flink, and Iceberg.

flink iceberg kafka realtime minio

Overview

This project simulates and processes high-volume SKU click events using a real-time pipeline built on Kafka, Flink, and Iceberg. It stores queryable data in Iceberg tables backed by MinIO and integrates with Grafana for aggregation monitoring.

It demonstrates:

  • Sub-minute ingestion and transformation latency
  • Schema evolution safety via Apache Iceberg
  • Durability and reprocessing using Kafka retention
  • Clean separation between dev and deploy environments

Architecture

  • Kafka receives generated click events (up to 50K/sec)
  • Flink (custom JAR job) normalizes and loads the data into Iceberg
  • Iceberg is backed by MinIO using a Hadoop catalog
  • Grafana queries the Iceberg tables and tracks lag metrics

Status

This project is In Progress (as endDate is not set). The core ingestion pipeline is stable, but the backpressure testing and auto-scaling components are still being tuned.

Challenges

  • Flink SQL CLI had compatibility issues with table catalogs
  • Iceberg required bootstrap hacks due to default DB assumptions
  • MinIO ACLs and Flink versioning caused silent failures

Lessons Learned

  • Always use small batch replays before enabling live ingestion
  • Avoid SQL CLI for production workloads — favor compiled jobs
  • Iceberg's flexibility comes with extra catalog config friction

Related Services

  • Kafka to Flink ingestion design
  • Iceberg migration and tuning
  • Schema evolution strategy
  • Real-time monitoring pipelines
  • AWS-compatible S3 object modeling

Recent Activity

Completed Iceberg table migration with zero downtime using dual-write pattern.

Python Python Iceberg Iceberg Airflow Airflow

Optimized Flink job performance, reduced processing latency by 40%.

Flink Flink Kafka Kafka