Hands-on, single-GPU fine-tuning: adapt an open model to your task with LoRA/QLoRA, evaluate it honestly against the base, then quantize and self-host it behind your own OpenAI-compatible API. Free-tier and Colab friendly throughout.
The practical, builder-focused path from a hosted LLM to your own fine-tuned model running on hardware you control. You'll decide when fine-tuning is even the right tool, pick a base model and license, build a clean dataset, and run a real QLoRA fine-tune on a single (free) GPU — then evaluate it rigorously, quantize it for deployment, and serve it behind an OpenAI-compatible API with Ollama, vLLM, or TGI. Python-first with Hugging Face TRL/PEFT, bitsandbytes, llama.cpp, and Ollama; every lab is designed to run on a free Colab/Kaggle GPU. The course ends with the production and iteration discipline — monitoring, versioning, lineage, and a data flywheel — that keeps a self-hosted fine-tune competitive over time.
Built by Lakshya Kumar
Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.
We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.
Sign in to applyComplete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.
Take an open base model and a custom dataset you built, fine-tune it with LoRA/QLoRA on a single GPU, evaluate it against the best-prompted base on a frozen held-out set, quantize it for deployment, and self-host it behind an OpenAI-compatible API. Submit the adapter, a before/after eval report, the quantized artifact, and the running endpoint with a swap into a small app.
I'm taking a hands-on, single-GPU "Fine-Tuning & Self-Hosting LLMs" course. It covers: deciding when to fine-tune vs. prompt vs. RAG, picking an open base model and checking its license, building a clean dataset, LoRA/QLoRA training on a free Colab/Kaggle GPU with Hugging Face TRL/PEFT, evaluating against the base, quantizing (GGUF/AWQ/GPTQ), and self-hosting behind an OpenAI-compatible API with Ollama/vLLM/TGI, plus production and iteration practices. Here is my situation: - My task / what I want the model to do better: [describe the behavior or format] - My base model (or "help me pick"): [e.g. Qwen2.5-7B-Instruct] - My GPU / where I'll train: [e.g. free Colab T4, 16GB] - My dataset (or what I have to build one from): [describe source + rough size] - Where I'll serve it: [e.g. Ollama on CPU / vLLM on a GPU VM] Act as my fine-tuning coach. First sanity-check whether fine-tuning is even the right tool for my task (vs. a better prompt or RAG). If it is, walk me through the smallest viable LoRA/QLoRA setup that fits my GPU, what dataset to build and how to verify it, how to evaluate against the best-prompted base, which quantization/format to use for my serving target, and how to self-host it. Flag the most likely failure (OOM, overfitting, wrong format, license issue) for my specific situation and how to avoid it.
Build a clean, leakage-free, versioned fine-tuning dataset for a real task from a raw source: clean, dedupe, format as chat messages, split train/val, and optionally augment with filtered synthetic data. Submit the JSONL dataset, metadata/provenance, and a quality report (clean fraction, dedup stats, leakage check, data-scaling note).
Run a complete QLoRA fine-tune on a single (free) GPU and document it: hyperparameter choices (rank/alpha/targets, batch/accumulation/LR), memory budget vs. your GPU, loss curves, any OOM fixes applied, and the saved adapter. Submit the training log, curves, config, and a writeup justifying each decision.
Merge an adapter and quantize the model across at least three levels/formats (e.g. fp16, GGUF q4/q5, AWQ), then measure the size-vs-quality trade-off on a held-out eval and pick a deployable variant. Submit the size-vs-quality table, the chosen format with justification for a stated runtime, and a verification report.
Self-host a model behind an OpenAI-compatible server, containerize it, and swap it into an app that previously used a hosted API — behind a config flag with rollback. Submit the Dockerized server, endpoint tests, a documented capacity + cost-per-token number, and evidence the app's behavior held after the swap.
SFTTrainer is the workhorse of this course's training runs. Read the SFT section before Module 5.