Machine Learning & AI / Course

Fine-Tuning & Self-Hosting LLMs

Hands-on, single-GPU fine-tuning: adapt an open model to your task with LoRA/QLoRA, evaluate it honestly against the base, then quantize and self-host it behind your own OpenAI-compatible API. Free-tier and Colab friendly throughout.

Free preview

Certificate: 1 of 5 capstones

The practical, builder-focused path from a hosted LLM to your own fine-tuned model running on hardware you control. You'll decide when fine-tuning is even the right tool, pick a base model and license, build a clean dataset, and run a real QLoRA fine-tune on a single (free) GPU — then evaluate it rigorously, quantize it for deployment, and serve it behind an OpenAI-compatible API with Ollama, vLLM, or TGI. Python-first with Hugging Face TRL/PEFT, bitsandbytes, llama.cpp, and Ollama; every lab is designed to run on a free Colab/Kaggle GPU. The course ends with the production and iteration discipline — monitoring, versioning, lineage, and a data flywheel — that keeps a self-hosted fine-tune competitive over time.

Built by Lakshya Kumar

fine-tuning

lora

qlora

self-hosting

quantization

llm

ollama

Before you start4 items

Comfortable in Python and have used the Hugging Face ecosystem or are willing to learn it here.
Access to a single GPU (a free Colab/Kaggle GPU is enough for the labs).
Have called an LLM API before and understand prompting basics.
Basic command line + Docker familiarity for the self-hosting modules.

Is this course for you?Ask an AI

Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.

Get access to Fine-Tuning & Self-Hosting LLMs

$3.99

30-day access

Prefer the whole catalog? See all-access membership.

Ask for access

We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.

Capstone projects

Submit any 1 of 5 to earn the certificate

Complete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.

capstoneFine-tune, evaluate, quantize, and self-host an open model

Take an open base model and a custom dataset you built, fine-tune it with LoRA/QLoRA on a single GPU, evaluate it against the best-prompted base on a frozen held-out set, quantize it for deployment, and self-host it behind an OpenAI-compatible API. Submit the adapter, a before/after eval report, the quantized artifact, and the running endpoint with a swap into a small app.

Submit your fine-tuned, self-hosted modelMinimum rating for approval: 3/5

dataset-and-quality-reportDataset build + quality report

Further reading & study material6 sources

Prompt

I'm taking a hands-on, single-GPU "Fine-Tuning & Self-Hosting LLMs" course. It covers: deciding when to fine-tune vs. prompt vs. RAG, picking an open base model and checking its license, building a clean dataset, LoRA/QLoRA training on a free Colab/Kaggle GPU with Hugging Face TRL/PEFT, evaluating against the base, quantizing (GGUF/AWQ/GPTQ), and self-hosting behind an OpenAI-compatible API with Ollama/vLLM/TGI, plus production and iteration practices.

Here is my situation:
- My task / what I want the model to do better: [describe the behavior or format]
- My base model (or "help me pick"): [e.g. Qwen2.5-7B-Instruct]
- My GPU / where I'll train: [e.g. free Colab T4, 16GB]
- My dataset (or what I have to build one from): [describe source + rough size]
- Where I'll serve it: [e.g. Ollama on CPU / vLLM on a GPU VM]

Act as my fine-tuning coach. First sanity-check whether fine-tuning is even the right tool for my task (vs. a better prompt or RAG). If it is, walk me through the smallest viable LoRA/QLoRA setup that fits my GPU, what dataset to build and how to verify it, how to evaluate against the best-prompted base, which quantization/format to use for my serving target, and how to self-host it. Flag the most likely failure (OOM, overfitting, wrong format, license issue) for my specific situation and how to avoid it.

Build a clean, leakage-free, versioned fine-tuning dataset for a real task from a raw source: clean, dedupe, format as chat messages, split train/val, and optionally augment with filtered synthetic data. Submit the JSONL dataset, metadata/provenance, and a quality report (clean fraction, dedup stats, leakage check, data-scaling note).

Submit dataset + reportMinimum rating for approval: 3/5

qlora-training-writeupQLoRA training run + writeup

Run a complete QLoRA fine-tune on a single (free) GPU and document it: hyperparameter choices (rank/alpha/targets, batch/accumulation/LR), memory budget vs. your GPU, loss curves, any OOM fixes applied, and the saved adapter. Submit the training log, curves, config, and a writeup justifying each decision.

Submit training run + writeupMinimum rating for approval: 3/5

quantization-quality-studyQuantization quality/size study

Merge an adapter and quantize the model across at least three levels/formats (e.g. fp16, GGUF q4/q5, AWQ), then measure the size-vs-quality trade-off on a held-out eval and pick a deployable variant. Submit the size-vs-quality table, the chosen format with justification for a stated runtime, and a verification report.

Submit quantization studyMinimum rating for approval: 3/5

self-hosted-replacementSelf-hosted deployment replacing a hosted API

Self-host a model behind an OpenAI-compatible server, containerize it, and swap it into an app that previously used a hosted API — behind a config flag with rollback. Submit the Dockerized server, endpoint tests, a documented capacity + cost-per-token number, and evidence the app's behavior held after the swap.

Submit self-hosted deploymentMinimum rating for approval: 3/5

SFTTrainer is the workhorse of this course's training runs. Read the SFT section before Module 5.

Fine-Tuning & Self-Hosting LLMs

When to Fine-Tune

Picking a Base Model & License

Building a Fine-Tuning Dataset

LoRA & QLoRA

Your First Fine-Tune

Evaluating the Fine-Tune

Merging & Quantizing for Deployment

Self-Hosting Your Model

Putting It in Production

Iterating