Fine-tuning, quantization, and production deployment of large language models. For engineers who want to go beyond API calls.
Go beyond prompting and RAG. This course covers the full model adaptation lifecycle: LoRA and QLoRA fine-tuning from first principles, SFT pipelines with TRL, preference optimization with DPO and RLHF, quantization schemes (int8, int4, GPTQ, AWQ), production serving with vLLM and TGI, distributed training across multiple GPUs, custom training objectives, evaluation suites that catch regressions before they ship, multi-model deployment infrastructure, and safety and alignment engineering. Every module has working Python code (no pseudocode, no toy examples) and a real project. The capstone is a fine-tuned, quantized, evaluated, and deployed model you built from scratch.
Built by Lakshya Kumar
We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.
Sign in to applyComplete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.
Pick a domain (customer support, code review, legal summarization, medical Q&A, or similar). Fine-tune a 7B model with LoRA or QLoRA using TRL. Run DPO on top with 200+ preference pairs. Quantize to int4 with AWQ. Evaluate with a custom 50-example eval set showing ≥70% win rate vs base model. Deploy behind a vLLM server with a FastAPI gateway that handles auth, rate limiting, and structured logging. Ship as a GitHub repo with README, model card (training procedure, eval results, known limitations), and a working Docker setup.
Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.
I'm considering 'Advanced AI Engineering' — a course on fine-tuning, quantization, and production deployment of large language models. 100 Python challenges: LoRA from scratch, QLoRA memory math, TRL SFT pipelines, DPO and RLHF, quantization (int8/int4/GPTQ/AWQ), vLLM/TGI serving, distributed training, custom objectives, eval suites, multi-model deployment, and safety engineering. Context about me: 1. My current AI work: [e.g. "I call OpenAI APIs at work", "I've fine-tuned BERT for classification", "I did the LLM from Scratch course", "I run models locally with Ollama"] 2. My GPU access: [e.g. "only Colab T4", "personal 4090", "work A100 cluster", "cloud spot instances"] 3. What I want to be able to do: [e.g. "fine-tune Llama for my company's domain", "reduce serving costs by 10×", "build an AI product that doesn't depend on OpenAI", "run models on-premise for privacy"] Answer: - Which 2 modules give me the highest leverage in the next 3 months? - What concrete artifact will I build that proves I can do this work? - Is this course right for me or should I do 'Applied AI: From ML to Modern Systems' first? - What will I NOT be able to do after this — e.g. "pre-train a model from scratch", "match OpenAI's safety team", "build GPT-4 level performance"?
Run an RLHF (or DPO) pipeline on a small open-source model (e.g., Llama 3.2 1B). Generate preference data, train a reward model (or use direct preference optimization), fine-tune the base model, and evaluate before/after on alignment + capability benchmarks.
Implement speculative decoding (draft + verify) with a small draft model and large target model. Measure throughput gain vs vanilla decoding on a 200-token generation benchmark. Identify when speculation helps and when overhead dominates.
Take a 4k-context model and extend its effective context to 32k via RoPE scaling (linear or YaRN). Validate on the needle-in-haystack benchmark at 8k, 16k, 32k. Document the degradation curve and the practical context limit.
Take a fine-tuned model and produce 4 deployable variants: bf16 baseline, INT8 quantized, INT4 (AWQ or GPTQ), and llama.cpp Q4_K_M. Benchmark each for latency (P50/P99), throughput (tok/s), quality (3 benchmark suites), and cost per million tokens. Pick the winner with justification.
The QLoRA paper. Read before Module 2.