open source · 6 phases complete · phase 6 in progress

Real-time CryptoMarket Lakehouse

$

An end-to-end streaming data platform for crypto market microstructure analysis. ~5M ticks/day · 50 pairs · sub-30s freshness. Built to production engineering standards.

SCROLL

// architecture

End-to-end pipeline

Every message is tracked from exchange wire to API response. Streaming ingestion layer on top; storage and query layer below.

SOURCEBROKERPROCESSINGSTORAGEQUERYANALYTICSAPIWAL / CDCBinance WSL2 Order BookRedpandaKafka-compatPyFlinkStream proc.Icebergon MinIOLakehouseDebeziumPostgres CDCTrinoQuery enginedbtTransformationsFastAPIREST API
Source

Binance WS

L2 order book snapshots + diffs streamed over WebSocket for 50 trading pairs. On startup: REST snapshot → buffer diffs → apply in sequence. Gap detected → reconnect with exponential backoff.

Broker

Redpanda

Kafka-compatible broker with no ZooKeeper. Topics: market.raw.orderbook, market.normalized.*, market.agg.ohlcv_1m, market.dlq. Idempotent producer with deterministic message keys.

Processing

PyFlink

Two stateful streaming jobs: (1) normalize + deduplicate → book ticker + spread + imbalance; (2) tumbling 1-min window → OHLCV. Event-time watermarks, late-event DLQ routing.

Storage

Iceberg / MinIO

Three-layer lakehouse: raw.* (append-only, full Kafka metadata), normalized.* (typed, deduped), marts.* (aggregated, API-facing). Full Kafka offset preserved for replay.

Query

Trino

Trino 457 queries Iceberg via REST catalog. MPP engine decouples storage from compute. In-process read model pre-loads 60 bars/symbol; hot-path requests skip Trino entirely.

Analytics

dbt

Transformation-as-code. Staging → intermediate → marts lineage. Tests baked into the model graph. dbt models drive the marts layer consumed by FastAPI.

API

FastAPI

Async REST API with Pydantic v2 response models. Background poller refreshes in-memory read model every 30–60s. p95 latency 10ms at 1,000 req/s (local benchmark).

CDC

Postgres + Debezium

WAL-based change data capture. Postgres 16 with logical replication. Symbol config changes propagate to Redpanda without polling. Zero impact on OLTP write path.

// storage layers

Medallion architecture

Three Iceberg layers on MinIO. Data is never deleted from bronze — each layer adds structure and removes noise, preserving full replay capability at every stage.

BronzeSilverGoldraw → curated → business-facing
Bronze
raw.*
append-only

tables

raw.orderbook_diffs
  • Full Kafka metadata preserved (topic · partition · offset)
  • Raw Binance L2 payload — never mutated
  • Replay-safe: any offset range re-processable
  • Written by PyFlink normalize job on arrival
Silver
normalized.*
deduplicated

tables

normalized.book_ticker
normalized.ohlcv_1m
  • Typed + schema-enforced, dedup by last_update_id
  • Event-time indexed, 30s watermark
  • Late events routed to market.dlq
  • Spread, mid-price, top-5 imbalance computed by Flink

↳ dbt source for staging models

Gold
marts.*
API-facing

tables

marts.mart_ohlcv
marts.mart_liquidity
marts.mart_exchange_health
  • Produced by dbt: staging → intermediate → marts
  • mart_ohlcv: VWAP, OHLC per 1m window
  • mart_liquidity: latest spread_bps, imbalance signal
  • mart_exchange_health: per-symbol freshness + SLA status

↳ Queried directly by FastAPI read model

// performance

By the numbers

Local benchmark. Single-machine Docker stack — not a production deployment. Numbers reflect the engineering quality of the pipeline design.

throughput.daily_events

~5Mticks / day *

Estimated throughput at full capacity across 50 pairs. The pipeline runs locally on demand — not a continuously operated system.

ingest.symbol_count

50pairs

Concurrent trading pairs tracked via a single multiplexed WebSocket

pipeline.end_to_end_latency

<30sE2E

Exchange wire → Iceberg silver layer, under normal operating conditions

api.peak_qps_tested

1,000req / s

Peak load tested with k6; 0% error rate sustained throughout

api.latency_p95

10msp95

API response time at 95th percentile. Local benchmark — not production.

tests.line_coverage

89%coverage

100+ unit and integration tests across ingest, flink, and API layers

* ~5M ticks/day is an estimate at full capacity — pipeline runs locally on demand, not continuously. p95 measured with k6, single-machine Docker stack, 2026-05-18

// tech stack

Every layer, intentional

No technology was chosen by default. Each replaces a mainstream alternative with a deliberate tradeoff.

ingestion

Python 3.13 + aiokafka

Binance WebSocket → Kafka producer

Async I/O handles 50 concurrent streams; idempotent producer with DLQ routing

broker

Redpanda v24.3

Kafka-compatible event broker

No ZooKeeper dependency; drop-in Kafka replacement with better tail latency

processing

PyFlink 1.18

Stateful stream processing

Exactly-once semantics, event-time watermarks, native Iceberg sink connector

storage

Apache Iceberg 1.6

Open table format on MinIO / GCS

ACID transactions, time-travel queries, schema evolution — without a proprietary warehouse

cdc

Debezium + Postgres 16

WAL-based change data capture

Zero-impact CDC via logical replication; symbol config changes propagate in real time

query

Trino 457

Federated SQL over Iceberg

MPP query engine; decouples storage from compute; familiar SQL interface for analysts

analytics

dbt

Transformation-as-code

Staging → intermediate → marts lineage; tests baked into the model graph

api

FastAPI + Pydantic v2

Async REST API, in-memory read model

Hot path serves from pre-loaded in-memory model; cold path falls back to Trino

catalog

Iceberg REST (tabulario)

Open catalog for Flink + Trino

Standard REST catalog spec; both Flink and Trino register tables without coordination

orchestration

Airflow

DAG-based pipeline orchestration

Idempotent backfill DAGs; sensor-based SLA alerting; retry policies per task

backfill

Apache Spark

Distributed compaction & backfill

Small-file compaction for Iceberg; parallel historical replay across partitions

observability

Prometheus + Grafana

Metrics, latency, pipeline health

Custom Prometheus middleware on FastAPI; Grafana panels track p50/p95/p99 per endpoint