A practical, milestone-driven playbook: what breaks at each step (100 → 1k → 10k → 100k users) and what to add — not in primitives, in moves.
A timeline-first course on running a real service from launch to 100,000 users. Each module is a phase: Day 0 (single server) → first 100 users (observability) → first 1k users (the database cracks) → 10k users (caching, replicas, CDN) → queues → horizontal scale → DB at scale → capacity, cost & on-call → deploys & blast-radius → 100k users in production (incidents & post-mortems). Every challenge is a concrete move, with code in Go / Python / Node / Rust where it matters, and references to real war stories. Distinct from `scalable-systems` (primitives) and `api-performance` (latency) — this one is about *what to do next*, in order.
Built by Lakshya Kumar
We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.
Sign in to applyComplete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.
Pick a toy CRUD app (build a tiny one if you have to). Write a 'launch to 100k DAU' plan: starting stack, projected milestones at 100/1k/10k/100k users, capacity math at each stage, the three things you'd add first, the three things you'd do *wrong* on the first try and how you'd recognize the mistake, and the SLOs you'd defend. The plan is the artifact; the app is the prop.
Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.
I'm taking a "Scaling to 100k Users" course that's organized by user-count milestone (Day 0 → 100 → 1k → 10k → 100k) rather than by primitive. It covers: single-server stacks, observability, DB performance, caching and CDN, queues, horizontal scale and statelessness, the database at scale, capacity/cost/SLOs, deploys and blast-radius control, and incidents/post-mortems. Here's my context: 1. My current product/project is: [describe] 2. Current user/traffic state: [e.g. pre-launch / 50 DAU / 5k DAU / etc.] 3. My current stack: [language, DB, hosting] 4. Where I think the next bottleneck is: [my guess] Given that, answer these specifically: - Which module of this course should I prioritize, and why? - Name 3 concrete things in my current state that would benefit most from this course. - Name 1 thing the course will NOT help me with so I don't have wrong expectations. - If I only have 2 hours this week, which single skill from the syllabus gives me the biggest payoff, and how would I measure that it worked?
Pick a real (or fictional) service. Write a complete on-call package: SLO doc, top-5 alert rules each with annotations and runbook, an incident response template (declare/IC/comms/post-mortem), a kill-switch registry with at least 4 entries, and a quarterly chaos-drill schedule. Submit the pack — this is the artifact a new SRE would inherit.
Take a real (or scaffolded) app under realistic load. Find and fix three perf bottlenecks: one DB-level (index, N+1, query rewrite), one app-level (caching, projection, async), one infra-level (CDN, gzip, HTTP/2, keepalive). Submit before/after p99 latency, throughput, and cost-per-request — with the EXPLAIN plans and profiling output to back each up.
Take a single-box (or sticky) app to 3+ instances behind a load balancer. Move every piece of state — sessions, uploads, locks, in-memory caches — out to shared infrastructure. Verify by killing instances mid-traffic with zero user-visible errors. Submit the diff, the LB config, the /readyz semantics, and chaos-test evidence.
Take a deploy process that currently risks user-visible 5xx. Add: rolling deploys with /readyz semantics + drain handler, feature flags (with kill switches), a smoke-test gate in CI, a quarterly rollback drill, and a documented blue-green or canary plan for a specific class of risky changes. Submit before/after deploy timeline, evidence of zero-5xx deploys over the next 10 deploys, and the rollback drill record.
Modules 8 and 10 pull directly from this. The chapters on monitoring, alerting on SLOs, and postmortem culture are required reading.
Stability patterns (circuit breakers, bulkheads, timeouts), capacity, deployment — the practitioner's companion to this course.