Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
The biggest change in this field since 2023 is that for a huge class of LLM applications, you never train a model at all. You call a hosted API, and the artifacts you actually iterate on are prompts, retrieval configs, tool definitions, and agent flows. That doesn't make MLOps obsolete — it relocates it. Experiment tracking now tracks prompt versions and eval scores instead of training runs; the model registry holds prompts and chains and fine-tunes; CI gates on eval-set regressions instead of accuracy curves. If you treat prompts and RAG configs as throwaway strings instead of versioned, tested artifacts, you get all the rot of ML with none of the discipline. LLMOps is MLOps applied to artifacts that aren't model weights.
The demo reframes the classic MLOps artifact list into its LLMOps equivalents, making concrete that a prompt is an experiment, a prompt+chain is a registry entry, and an eval set is your test suite.
mlops_to_llmops = {
"training run": "prompt / chain / RAG-config experiment",
"trained model weights": "versioned prompt template + tool defs + retriever config",
"model registry entry": "registered prompt/chain/fine-tune with lineage",
"accuracy on test set": "score on a golden eval set (incl. LLM-as-judge)",
"retrain trigger": "prompt iteration or provider model update",
"feature pipeline": "ingestion + chunking + embedding pipeline for the corpus",
}
for ml, llm in mlops_to_llmops.items():
print(f"{ml:<24} -> {llm}")
# The lesson: prompts/configs are ARTIFACTS. Version them, test them, register them.
def is_versioned(artifact, in_git, has_eval):
return in_git and has_eval
print("prompt under control?", is_versioned("system_prompt_v3", in_git=True, has_eval=False))python3 main.pyUse these three in order. Each builds on the one before.
In one paragraph, explain how LLMOps differs from classic MLOps, like I'm new to it.
Walk me through which MLOps practices carry over directly to an LLM app that does no training, and what the new 'artifacts' are.
Given an LLM app built entirely on a hosted API (no training), design the minimal LLMOps discipline it still needs and justify each piece.
When the model call fails. Read the error and decide: fix the request, retry, or fall back.
400/422(bad params, context-length exceeded),401/403(auth / no access to that model),404(wrong model id) are fatal — fix and don't retry.429,500/502/503, Anthropic529(overloaded), and timeouts are transient — retry with backoff. Watch for non-HTTP failures too:finish_reason: "length"truncation (raisemax_tokensor continue), safety refusals, malformed JSON / failed tool-call parsing (validate against a schema and repair-retry), and mid-stream disconnects. Always log the provider request id with the error so you can trace it later.