Engineering / Course

Scaling to 100k Users: The Playbook

A practical, milestone-driven playbook: what breaks at each step (100 → 1k → 10k → 100k users) and what to add — not in primitives, in moves.

Free preview

Certificate: 1 of 5 capstones

A timeline-first course on running a real service from launch to 100,000 users. Each module is a phase: Day 0 (single server) → first 100 users (observability) → first 1k users (the database cracks) → 10k users (caching, replicas, CDN) → queues → horizontal scale → DB at scale → capacity, cost & on-call → deploys & blast-radius → 100k users in production (incidents & post-mortems). Every challenge is a concrete move, with code in Go / Python / Node / Rust where it matters, and references to real war stories. Distinct from `scalable-systems` (primitives) and `api-performance` (latency) — this one is about *what to do next*, in order.

Built by Lakshya Kumar

engineering

scaling

production

operations

infrastructure

playbook

Before you start4 items

You've shipped at least one web service to production (even a hobby one), and know what TLS, JSON, and SQL are.
You can read code in at least one of Go, Python, Node.js, or Rust. The course's code tabs cover multiple; pick one.
You're not afraid of the command line — Docker, ssh, kubectl-ish workflows.
Docker installed locally; ability to spin up a Postgres or Redis in a container.

Is this course for you?Ask an AI

Get access to Scaling to 100k Users: The Playbook

$3.99

30-day access

Prefer the whole catalog? See all-access membership.

Ask for access

We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.

Capstone projects

Submit any 1 of 5 to earn the certificate

Complete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.

capstone100k DAU growth plan with capacity math

Pick a toy CRUD app (build a tiny one if you have to). Write a 'launch to 100k DAU' plan: starting stack, projected milestones at 100/1k/10k/100k users, capacity math at each stage, the three things you'd add first, the three things you'd do *wrong* on the first try and how you'd recognize the mistake, and the SLOs you'd defend. The plan is the artifact; the app is the prop.

Submit growth planMinimum rating for approval: 3/5

incident-response-runbook-packIncident response runbook pack

Further reading & study material7 sources

Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.

Prompt

I'm taking a "Scaling to 100k Users" course that's organized by user-count milestone (Day 0 → 100 → 1k → 10k → 100k) rather than by primitive. It covers: single-server stacks, observability, DB performance, caching and CDN, queues, horizontal scale and statelessness, the database at scale, capacity/cost/SLOs, deploys and blast-radius control, and incidents/post-mortems.

Here's my context:
1. My current product/project is: [describe]
2. Current user/traffic state: [e.g. pre-launch / 50 DAU / 5k DAU / etc.]
3. My current stack: [language, DB, hosting]
4. Where I think the next bottleneck is: [my guess]

Given that, answer these specifically:
- Which module of this course should I prioritize, and why?
- Name 3 concrete things in my current state that would benefit most from this course.
- Name 1 thing the course will NOT help me with so I don't have wrong expectations.
- If I only have 2 hours this week, which single skill from the syllabus gives me the biggest payoff, and how would I measure that it worked?

Pick a real (or fictional) service. Write a complete on-call package: SLO doc, top-5 alert rules each with annotations and runbook, an incident response template (declare/IC/comms/post-mortem), a kill-switch registry with at least 4 entries, and a quarterly chaos-drill schedule. Submit the pack — this is the artifact a new SRE would inherit.

SubmitMinimum rating for approval: 3/5

perf-overhaulPerf overhaul: 5x latency improvement

Take a real (or scaffolded) app under realistic load. Find and fix three perf bottlenecks: one DB-level (index, N+1, query rewrite), one app-level (caching, projection, async), one infra-level (CDN, gzip, HTTP/2, keepalive). Submit before/after p99 latency, throughput, and cost-per-request — with the EXPLAIN plans and profiling output to back each up.

SubmitMinimum rating for approval: 3/5

horizontal-scale-rolloutHorizontal-scale rollout

Take a single-box (or sticky) app to 3+ instances behind a load balancer. Move every piece of state — sessions, uploads, locks, in-memory caches — out to shared infrastructure. Verify by killing instances mid-traffic with zero user-visible errors. Submit the diff, the LB config, the /readyz semantics, and chaos-test evidence.

SubmitMinimum rating for approval: 3/5

release-engineering-overhaulRelease-engineering overhaul

Take a deploy process that currently risks user-visible 5xx. Add: rolling deploys with /readyz semantics + drain handler, feature flags (with kill switches), a smoke-test gate in CI, a quarterly rollback drill, and a documented blue-green or canary plan for a specific class of risky changes. Submit before/after deploy timeline, evidence of zero-5xx deploys over the next 10 deploys, and the rollback drill record.

SubmitMinimum rating for approval: 3/5

Modules 8 and 10 pull directly from this. The chapters on monitoring, alerting on SLOs, and postmortem culture are required reading.

Release It! (Michael Nygard, 2nd ed.)

book

Stability patterns (circuit breakers, bulkheads, timeouts), capacity, deployment — the practitioner's companion to this course.

Scaling to 100k Users: The Playbook

Day 0: single server, real stack

First 100 users: observability before growth

First 1k users: the database cracks first

10k users: caching, replicas, CDN

Queues: move slow work off the request path

Horizontal scale: stateless, sessions, sticky bits

The database at scale

Capacity, cost, and on-call

Deploys & blast-radius control

100k users in production