Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Every time the OS switches from one thread to another, it pays a tax that your application feels but never sees in profiling: saving and restoring 40+ registers, flushing the TLB on some architectures, and more critically, evicting the L1/L2 cache lines the previous thread was actively working with. A context switch itself takes 1–5 µs on Linux, but the subsequent cache-miss cascade can add 100 µs of effective slowdown per switch. Multiply by 100,000 context switches per second — easily reached on a thread-per-request server under high load — and you're burning a full CPU core on overhead that produces zero user-visible work. This is why event-loop servers like nginx outperform Apache's thread-per-connection model at high concurrency even on the same hardware.
Rough numbers on a modern Linux box:
If you're context-switching 1M times per second, you've just burned a full CPU core on overhead. This is why event-loop servers like nginx and Node.js can outperform thread-per-request servers under high concurrency.
vmstat 1 while running a thread-heavy benchmark. Watch the cs (context switch) column climb. Record the rate at peak load.perf stat -e context-switches (Linux) or sudo dtrace -n 'sched:::off-cpu { @[execname] = count(); }' (macOS) to count context switches for a simple program with 100 threads sleeping in a tight loop.Use these three in order. Each builds on the one before.
How expensive is a thread context switch on Linux, and what exactly happens during one?
Explain why L1/L2 cache-miss penalties often dominate the measured cost of a context switch, even more than the kernel overhead itself.
Design a micro-benchmark that isolates context-switch cost from the work done between switches. How do I avoid accidentally measuring something else?