Project · Build and profile a from-scratch generation loop

project

hard

module project

Ship something real. Submit your work when you're done.

Brief

Run a small model locally and build a generation loop you control: implement greedy and sampled decoding, measure prefill time vs. decode tokens/sec separately, handle stop conditions, and produce a short report on where the time goes. This is the baseline you'll optimize in later modules.

Deliverables

A local generation loop with both greedy and temperature/top-p sampling.
Separate measurements of prefill latency and decode tokens/sec, plus memory use.
A short report mapping your numbers to prefill/decode and explaining what dominates.

How we grade it

Sampling parameters demonstrably change output diversity/determinism.
Prefill and decode are measured separately, not lumped together.
The report correctly attributes latency to phases and identifies the dominant cost.

Project · Build and profile a from-scratch generation loop

Hints

Stretch goals