Evaluating AI systems — BLEU, ROUGE, LLM-as-judge, human eval

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

You can't improve what you can't measure. AI systems are notoriously hard to evaluate: BLEU and ROUGE correlate weakly with human judgement for open-ended generation; human evaluation is expensive and slow to iterate; automatic metrics like accuracy don't apply to free-form text. LLM-as-judge (having a stronger model score the output of a weaker one) has emerged as the cheapest evaluation that correlates best with human preference — but it has its own biases (length bias, self-preference). Understanding which metric to use for which task, and how to build an eval harness you can run on every code change, is what separates teams that iterate on AI quality from teams that ship and pray.

Demo

ROUGE and BLEU measure n-gram overlap between a candidate and a reference, which makes them fast and reproducible but blind to paraphrases and semantics — a summary that rephrases everything correctly can score near zero. LLM-as-judge asks a stronger model to rate outputs on structured criteria such as accuracy, completeness, and conciseness, correlating far better with human judgment for open-ended generation tasks.

# pip install rouge-score nltk
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import nltk; nltk.download("punkt", quiet=True)

reference = "The transformer model uses attention to process all tokens in parallel, enabling long-range dependencies."
hypothesis_good = "Transformers use attention mechanisms to process tokens simultaneously and capture long-range context."
hypothesis_bad  = "The model is fast and works well for text tasks."

scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)

for name, hyp in [("good", hypothesis_good), ("bad", hypothesis_bad)]:
    r = scorer.score(reference, hyp)
    ref_tok = [reference.lower().split()]
    hyp_tok = hyp.lower().split()
    bleu = sentence_bleu(ref_tok, hyp_tok, smoothing_function=SmoothingFunction().method1)
    print(f"\n{name} hypothesis:")
    print(f"  ROUGE-1 F: {r['rouge1'].fmeasure:.3f}")
    print(f"  ROUGE-L F: {r['rougeL'].fmeasure:.3f}")
    print(f"  BLEU:      {bleu:.3f}")

# LLM-as-judge prompt pattern
judge_prompt = """You are a strict but fair judge. Rate the following summary on a scale of 1-5.

Reference: {reference}
Summary: {hypothesis}

Criteria:
- Factual accuracy (does it say anything wrong?)
- Completeness (does it cover the main point?)
- Conciseness (is it free of unnecessary words?)

Output JSON: {{"score": <1-5>, "reason": "<one sentence>"}}"""

print("\nLLM-as-judge prompt (send to any LLM):")
print(judge_prompt.format(reference=reference, hypothesis=hypothesis_good))

Run: python3 main.py

Try it yourself

Run ROUGE on a hypothesis that rephrases the reference using synonyms but no shared words: e.g., replace 'parallel' with 'simultaneously', 'long-range' with 'distant'. How does ROUGE-1 score? This exposes ROUGE's n-gram overlap limitation.

Write a hypothesis that is factually correct but very short: 'Transformers use attention.' Compute ROUGE and BLEU. It scores low because it's incomplete — but it's not wrong. This is why automatic metrics can penalize compression.

Design an LLM-as-judge prompt for a customer support chatbot response. Your criteria: (1) resolves the customer's issue, (2) doesn't make promises that require manager approval, (3) maintains a professional tone. Write the prompt, then test it on two example responses.

Research MT-Bench (a multi-turn benchmark for chat models). What tasks does it include? How does it use GPT-4 as a judge? Why is it more reliable than single-turn BLEU for evaluating instruction-following models?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain what ROUGE measures and why it correlates poorly with human judgement for creative or open-ended text generation tasks. Give a concrete example where high ROUGE corresponds to a bad summary.

2. Why it works (the mechanism)

Walk me through LLM-as-judge evaluation: what model is typically used as judge, what prompt structure works best, and what are the three main bias types (positional bias, length bias, self-preference) that make LLM judges unreliable without mitigation?

3. Advanced — application & what's next

I'm building an eval harness for an AI coding assistant. I need to measure: correctness (does the code run and pass tests?), quality (is the code idiomatic?), and safety (no hardcoded secrets, no shell injection). For each dimension: the metric I'd use, how to compute it automatically, how to sample for human review efficiently (not 100% human eval), and how often to run the full eval.