open source · 6 phases complete · phase 6 in progress

Real-time CryptoMarket Lakehouse

An end-to-end streaming data platform for crypto market microstructure analysis. ~5M ticks/day · 50 pairs · sub-30s freshness. Built to production engineering standards.

view source explore pipeline

SCROLL

// architecture

End-to-end pipeline

Every message is tracked from exchange wire to API response. Streaming ingestion layer on top; storage and query layer below.

Source

Binance WS

L2 order book snapshots + diffs streamed over WebSocket for 50 trading pairs. On startup: REST snapshot → buffer diffs → apply in sequence. Gap detected → reconnect with exponential backoff.

Broker

Redpanda

Kafka-compatible broker with no ZooKeeper. Topics: market.raw.orderbook, market.normalized.*, market.agg.ohlcv_1m, market.dlq. Idempotent producer with deterministic message keys.

Processing

PyFlink

Two stateful streaming jobs: (1) normalize + deduplicate → book ticker + spread + imbalance; (2) tumbling 1-min window → OHLCV. Event-time watermarks, late-event DLQ routing.

Storage

Iceberg / MinIO

Three-layer lakehouse: raw.* (append-only, full Kafka metadata), normalized.* (typed, deduped), marts.* (aggregated, API-facing). Full Kafka offset preserved for replay.

Query

Trino

Trino 457 queries Iceberg via REST catalog. MPP engine decouples storage from compute. In-process read model pre-loads 60 bars/symbol; hot-path requests skip Trino entirely.

Analytics

dbt

Transformation-as-code. Staging → intermediate → marts lineage. Tests baked into the model graph. dbt models drive the marts layer consumed by FastAPI.

API

FastAPI

Async REST API with Pydantic v2 response models. Background poller refreshes in-memory read model every 30–60s. p95 latency 10ms at 1,000 req/s (local benchmark).

CDC

Postgres + Debezium

WAL-based change data capture. Postgres 16 with logical replication. Symbol config changes propagate to Redpanda without polling. Zero impact on OLTP write path.

// storage layers

Medallion architecture

Three Iceberg layers on MinIO. Data is never deleted from bronze — each layer adds structure and removes noise, preserving full replay capability at every stage.

BronzeSilverGoldraw → curated → business-facing

Bronze

raw.*

append-only

tables

raw.orderbook_diffs

› Full Kafka metadata preserved (topic · partition · offset)
› Raw Binance L2 payload — never mutated
› Replay-safe: any offset range re-processable
› Written by PyFlink normalize job on arrival

Silver

normalized.*

deduplicated

tables

normalized.book_ticker

normalized.ohlcv_1m

› Typed + schema-enforced, dedup by last_update_id
› Event-time indexed, 30s watermark
› Late events routed to market.dlq
› Spread, mid-price, top-5 imbalance computed by Flink

↳ dbt source for staging models

Gold

marts.*

API-facing

tables

marts.mart_ohlcv

marts.mart_liquidity

marts.mart_exchange_health

› Produced by dbt: staging → intermediate → marts
› mart_ohlcv: VWAP, OHLC per 1m window
› mart_liquidity: latest spread_bps, imbalance signal
› mart_exchange_health: per-symbol freshness + SLA status

↳ Queried directly by FastAPI read model

// performance

By the numbers

Local benchmark. Single-machine Docker stack — not a production deployment. Numbers reflect the engineering quality of the pipeline design.

› throughput.daily_events

~5Mticks / day *

Estimated throughput at full capacity across 50 pairs. The pipeline runs locally on demand — not a continuously operated system.

› ingest.symbol_count

50pairs

Concurrent trading pairs tracked via a single multiplexed WebSocket

› pipeline.end_to_end_latency

<30sE2E

Exchange wire → Iceberg silver layer, under normal operating conditions

› api.peak_qps_tested

1,000req / s

Peak load tested with k6; 0% error rate sustained throughout

› api.latency_p95

10msp95

API response time at 95th percentile. Local benchmark — not production.

› tests.line_coverage

89%coverage

100+ unit and integration tests across ingest, flink, and API layers

* ~5M ticks/day is an estimate at full capacity — pipeline runs locally on demand, not continuously. p95 measured with k6, single-machine Docker stack, 2026-05-18

// tech stack

Every layer, intentional

No technology was chosen by default. Each replaces a mainstream alternative with a deliberate tradeoff.

ingestion

Python 3.13 + aiokafka

Binance WebSocket → Kafka producer

› Async I/O handles 50 concurrent streams; idempotent producer with DLQ routing

broker

Redpanda v24.3

Kafka-compatible event broker

› No ZooKeeper dependency; drop-in Kafka replacement with better tail latency

processing

PyFlink 1.18

Stateful stream processing

› Exactly-once semantics, event-time watermarks, native Iceberg sink connector

storage

Apache Iceberg 1.6

Open table format on MinIO / GCS

› ACID transactions, time-travel queries, schema evolution — without a proprietary warehouse

cdc

Debezium + Postgres 16

WAL-based change data capture

› Zero-impact CDC via logical replication; symbol config changes propagate in real time

query

Trino 457

Federated SQL over Iceberg

› MPP query engine; decouples storage from compute; familiar SQL interface for analysts

analytics

dbt

Transformation-as-code

› Staging → intermediate → marts lineage; tests baked into the model graph

api

FastAPI + Pydantic v2

Async REST API, in-memory read model

› Hot path serves from pre-loaded in-memory model; cold path falls back to Trino

catalog

Iceberg REST (tabulario)

Open catalog for Flink + Trino

› Standard REST catalog spec; both Flink and Trino register tables without coordination

orchestration

Airflow

DAG-based pipeline orchestration

› Idempotent backfill DAGs; sensor-based SLA alerting; retry policies per task

backfill

Apache Spark

Distributed compaction & backfill

› Small-file compaction for Iceberg; parallel historical replay across partitions

observability

Prometheus + Grafana

Metrics, latency, pipeline health

› Custom Prometheus middleware on FastAPI; Grafana panels track p50/p95/p99 per endpoint