Real-time CryptoMarket Lakehouse
An end-to-end streaming data platform for crypto market microstructure analysis. ~5M ticks/day · 50 pairs · sub-30s freshness. Built to production engineering standards.
// architecture
End-to-end pipeline
Every message is tracked from exchange wire to API response. Streaming ingestion layer on top; storage and query layer below.
Binance WS
L2 order book snapshots + diffs streamed over WebSocket for 50 trading pairs. On startup: REST snapshot → buffer diffs → apply in sequence. Gap detected → reconnect with exponential backoff.
Redpanda
Kafka-compatible broker with no ZooKeeper. Topics: market.raw.orderbook, market.normalized.*, market.agg.ohlcv_1m, market.dlq. Idempotent producer with deterministic message keys.
PyFlink
Two stateful streaming jobs: (1) normalize + deduplicate → book ticker + spread + imbalance; (2) tumbling 1-min window → OHLCV. Event-time watermarks, late-event DLQ routing.
Iceberg / MinIO
Three-layer lakehouse: raw.* (append-only, full Kafka metadata), normalized.* (typed, deduped), marts.* (aggregated, API-facing). Full Kafka offset preserved for replay.
Trino
Trino 457 queries Iceberg via REST catalog. MPP engine decouples storage from compute. In-process read model pre-loads 60 bars/symbol; hot-path requests skip Trino entirely.
dbt
Transformation-as-code. Staging → intermediate → marts lineage. Tests baked into the model graph. dbt models drive the marts layer consumed by FastAPI.
FastAPI
Async REST API with Pydantic v2 response models. Background poller refreshes in-memory read model every 30–60s. p95 latency 10ms at 1,000 req/s (local benchmark).
Postgres + Debezium
WAL-based change data capture. Postgres 16 with logical replication. Symbol config changes propagate to Redpanda without polling. Zero impact on OLTP write path.
// storage layers
Medallion architecture
Three Iceberg layers on MinIO. Data is never deleted from bronze — each layer adds structure and removes noise, preserving full replay capability at every stage.
raw.*tables
raw.orderbook_diffs- › Full Kafka metadata preserved (topic · partition · offset)
- › Raw Binance L2 payload — never mutated
- › Replay-safe: any offset range re-processable
- › Written by PyFlink normalize job on arrival
normalized.*tables
normalized.book_tickernormalized.ohlcv_1m- › Typed + schema-enforced, dedup by last_update_id
- › Event-time indexed, 30s watermark
- › Late events routed to market.dlq
- › Spread, mid-price, top-5 imbalance computed by Flink
↳ dbt source for staging models
marts.*tables
marts.mart_ohlcvmarts.mart_liquiditymarts.mart_exchange_health- › Produced by dbt: staging → intermediate → marts
- › mart_ohlcv: VWAP, OHLC per 1m window
- › mart_liquidity: latest spread_bps, imbalance signal
- › mart_exchange_health: per-symbol freshness + SLA status
↳ Queried directly by FastAPI read model
// performance
By the numbers
Local benchmark. Single-machine Docker stack — not a production deployment. Numbers reflect the engineering quality of the pipeline design.
› throughput.daily_events
Estimated throughput at full capacity across 50 pairs. The pipeline runs locally on demand — not a continuously operated system.
› ingest.symbol_count
Concurrent trading pairs tracked via a single multiplexed WebSocket
› pipeline.end_to_end_latency
Exchange wire → Iceberg silver layer, under normal operating conditions
› api.peak_qps_tested
Peak load tested with k6; 0% error rate sustained throughout
› api.latency_p95
API response time at 95th percentile. Local benchmark — not production.
› tests.line_coverage
100+ unit and integration tests across ingest, flink, and API layers
* ~5M ticks/day is an estimate at full capacity — pipeline runs locally on demand, not continuously. p95 measured with k6, single-machine Docker stack, 2026-05-18
// tech stack
Every layer, intentional
No technology was chosen by default. Each replaces a mainstream alternative with a deliberate tradeoff.
Python 3.13 + aiokafka
Binance WebSocket → Kafka producer
› Async I/O handles 50 concurrent streams; idempotent producer with DLQ routing
Redpanda v24.3
Kafka-compatible event broker
› No ZooKeeper dependency; drop-in Kafka replacement with better tail latency
PyFlink 1.18
Stateful stream processing
› Exactly-once semantics, event-time watermarks, native Iceberg sink connector
Apache Iceberg 1.6
Open table format on MinIO / GCS
› ACID transactions, time-travel queries, schema evolution — without a proprietary warehouse
Debezium + Postgres 16
WAL-based change data capture
› Zero-impact CDC via logical replication; symbol config changes propagate in real time
Trino 457
Federated SQL over Iceberg
› MPP query engine; decouples storage from compute; familiar SQL interface for analysts
dbt
Transformation-as-code
› Staging → intermediate → marts lineage; tests baked into the model graph
FastAPI + Pydantic v2
Async REST API, in-memory read model
› Hot path serves from pre-loaded in-memory model; cold path falls back to Trino
Iceberg REST (tabulario)
Open catalog for Flink + Trino
› Standard REST catalog spec; both Flink and Trino register tables without coordination
Airflow
DAG-based pipeline orchestration
› Idempotent backfill DAGs; sensor-based SLA alerting; retry policies per task
Apache Spark
Distributed compaction & backfill
› Small-file compaction for Iceberg; parallel historical replay across partitions
Prometheus + Grafana
Metrics, latency, pipeline health
› Custom Prometheus middleware on FastAPI; Grafana panels track p50/p95/p99 per endpoint