The MLOps → LLMOps shift

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

The biggest change in this field since 2023 is that for a huge class of LLM applications, you never train a model at all. You call a hosted API, and the artifacts you actually iterate on are prompts, retrieval configs, tool definitions, and agent flows. That doesn't make MLOps obsolete — it relocates it. Experiment tracking now tracks prompt versions and eval scores instead of training runs; the model registry holds prompts and chains and fine-tunes; CI gates on eval-set regressions instead of accuracy curves. If you treat prompts and RAG configs as throwaway strings instead of versioned, tested artifacts, you get all the rot of ML with none of the discipline. LLMOps is MLOps applied to artifacts that aren't model weights.

Demo

The demo reframes the classic MLOps artifact list into its LLMOps equivalents, making concrete that a prompt is an experiment, a prompt+chain is a registry entry, and an eval set is your test suite.

mlops_to_llmops = {
    "training run":          "prompt / chain / RAG-config experiment",
    "trained model weights": "versioned prompt template + tool defs + retriever config",
    "model registry entry":  "registered prompt/chain/fine-tune with lineage",
    "accuracy on test set":  "score on a golden eval set (incl. LLM-as-judge)",
    "retrain trigger":       "prompt iteration or provider model update",
    "feature pipeline":      "ingestion + chunking + embedding pipeline for the corpus",
}
for ml, llm in mlops_to_llmops.items():
    print(f"{ml:<24} ->  {llm}")

# The lesson: prompts/configs are ARTIFACTS. Version them, test them, register them.
def is_versioned(artifact, in_git, has_eval):
    return in_git and has_eval
print("prompt under control?", is_versioned("system_prompt_v3", in_git=True, has_eval=False))

Run: python3 main.py

Try it yourself

Map your own app's components onto the right column — which of your prompts/configs are versioned with evals?

Find one prompt in your codebase that lives as an inline string with no version and no eval — that's the gap.

Decide for your app: do you train/fine-tune anything, or is every 'model' a hosted API call?

Run is_versioned with has_eval=False and note it returns False — a prompt without an eval is not under control.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain how LLMOps differs from classic MLOps, like I'm new to it.

2. Why it works (the mechanism)

Walk me through which MLOps practices carry over directly to an LLM app that does no training, and what the new 'artifacts' are.

3. Advanced — application & what's next

Given an LLM app built entirely on a hosted API (no training), design the minimal LLMOps discipline it still needs and justify each piece.

References

When the model call fails. Read the error and decide: fix the request, retry, or fall back. 400/422 (bad params, context-length exceeded), 401/403 (auth / no access to that model), 404 (wrong model id) are fatal — fix and don't retry. 429, 500/502/503, Anthropic 529 (overloaded), and timeouts are transient — retry with backoff. Watch for non-HTTP failures too: finish_reason: "length" truncation (raise max_tokens or continue), safety refusals, malformed JSON / failed tool-call parsing (validate against a schema and repair-retry), and mid-stream disconnects. Always log the provider request id with the error so you can trace it later.