100 challenges from CQRS and CDC through Kafka, distributed transactions, time-series, object storage, streaming video, multi-region deployments, Lambda/Kappa architectures, and database internals deep cuts.
Data-Intensive Systems takes you from the first principles of event-driven architecture — CQRS, change-data capture, and the Kafka log — through the hard problems that appear at scale: distributed transactions across services, time-series data at write speeds no RDBMS can sustain, multi-region deployments with replication lag you can actually measure, and real-time analytics pipelines that must be correct under failure. The final two modules go below the query interface into PostgreSQL internals: buffer pools, WAL replay, MVCC row versions, autovacuum, B-tree splits, statistics histograms, and write amplification — the layer where production databases break in ways that `EXPLAIN` alone can't diagnose. Every challenge is a runnable program, every project has a testable correctness criterion, and the capstone requires you to design and build a production-grade data platform from scratch.
Built by Lakshya Kumar
Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.
We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.
Sign in to applyComplete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.
Design and implement a production-grade data platform that ingests events from at least two sources via Kafka, stores them in a time-series store and a relational store using a CQRS pattern, exposes a read API backed by materialized views, and includes a Kappa-style analytics pipeline that computes rolling 1-minute and session-level aggregations. The platform must handle at least 10,000 events/second, survive a simulated node failure without data loss (demonstrated via WAL replay or Kafka offset replay), and produce an ops runbook covering vacuum, index maintenance, and upgrade decisions.
This is a Data-Intensive Systems course for experienced builders who already write production code. Every task involves real, runnable code — no pseudocode, no toy examples. When helping with streaming tasks, use Kafka client libraries (kafka-go, confluent-kafka, rdkafka, kafkajs) and real Kafka semantics (offsets, consumer groups, watermarks). When helping with database internals tasks, use actual PostgreSQL system views (pg_stat_user_tables, pg_statistic, pg_buffercache, pgstattuple) and real SQL — never simulated or mocked DB behavior. For distributed systems tasks, be precise about CAP trade-offs, consistency levels, and failure modes; avoid hand-waving. If a builder asks why their metric disagrees between streaming and batch, walk through the watermark and window semantics before suggesting a code fix.
Build a Kafka-driven pipeline that guarantees exactly-once semantics from producer to sink. Include idempotent producer config, transactional consumer-to-producer chain, and a chaos test that kills consumers mid-batch and verifies no duplicates and no losses in the sink store.
Build a CDC pipeline that replicates Postgres changes (using logical replication or Debezium) to a columnar warehouse (Snowflake, BigQuery, or DuckDB). Handle schema evolution, large transaction batches, and a backfill from existing data. Validate row-count parity over a 24-hour run.
Set up an Apache Iceberg-based lakehouse on S3 (or Minio): ingest 100M rows, run schema evolution (add/drop columns), demonstrate time travel via snapshot queries, and benchmark query performance vs parquet-only baseline. Document the metadata layer.
Build a feature store for an ML use case: batch features in offline store (S3 + Parquet), real-time features in online store (Redis), a pipeline that backfills offline -> online, and a serving layer that returns features at P95 < 5ms. Include feature versioning and rollback.
Goes deeper than Kleppmann on storage engine internals — essential companion for Module 10.