Test coverage — what it measures, what it doesn't, and mutation testing

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

100% test coverage doesn't mean your tests are good. It means every line was executed — but a test that calls calculate_discount(100) and asserts result is not None achieves line coverage without testing anything meaningful. Coverage is a floor, not a ceiling: below 60% is a red flag, but 100% is not the goal. Mutation testing is the real measure of test quality: it introduces deliberate bugs (mutations) into your code — changing > to >=, replacing + with -, deleting a line — and checks whether your tests catch them. A test suite with 95% coverage that kills 40% of mutants is much weaker than an 80% coverage suite that kills 85% of mutants.

Demo

pytest-cov tells you which lines were executed during testing, but execution is not the same as verification: a test that calls a function and only asserts result is not None achieves line coverage while testing almost nothing. Mutation testing closes this gap by injecting deliberate bugs into your source — flipping > to >=, removing a negation, deleting a return — and checking whether any test fails as a result. A surviving mutant is a test gap: the code was broken and nobody noticed, which is exactly the kind of undetected defect that reaches production.

# pip install pytest pytest-cov

# The code under test
def calculate_shipping(weight_kg: float, is_express: bool) -> float:
    """Shipping cost: $5 base + $2/kg, express is 2× total. Free over $50."""
    if weight_kg <= 0:
        raise ValueError("Weight must be positive")
    base = 5.0 + (2.0 * weight_kg)
    if is_express:
        base *= 2
    if base >= 50.0:
        return 0.0   # free shipping
    return round(base, 2)

# ── Weak tests (100% line coverage but misses logic errors) ──────────────────
def test_standard_shipping_WEAK():
    result = calculate_shipping(5.0, False)
    assert result > 0          # ← passes even if formula is wrong

def test_express_shipping_WEAK():
    result = calculate_shipping(5.0, True)
    assert result != calculate_shipping(5.0, False)  # ← just checks it's different

# ── Strong tests (actual values, each branch and boundary) ───────────────────
def test_standard_shipping_exact_value():
    # Arrange: 5kg standard → base = 5 + 2*5 = 15
    assert calculate_shipping(5.0, False) == 15.0

def test_express_doubles_the_total():
    # base = 15.0 → express = 30.0
    assert calculate_shipping(5.0, True) == 30.0

def test_free_shipping_at_threshold():
    # 22.5kg standard: 5 + 45 = 50 → free
    assert calculate_shipping(22.5, False) == 0.0

def test_just_below_free_shipping_threshold():
    # 22.4kg: 5 + 44.8 = 49.8 → NOT free (boundary value)
    assert calculate_shipping(22.4, False) == 49.8

def test_zero_weight_raises():
    import pytest
    with pytest.raises(ValueError):
        calculate_shipping(0, False)

# Run: pytest --cov=. --cov-report=term-missing
# Both sets reach 100% line coverage — but the weak tests miss the formula bugs
# To find the difference, run mutation testing:
# pip install mutmut && mutmut run && mutmut results

Run: python3 main.py

Try it yourself

Run pytest --cov=. --cov-report=term-missing on both the weak and strong test sets. Do both show 100% line coverage? Now change >= to > in the free-shipping threshold check. Which test set catches the bug?

Install mutmut (pip install mutmut) and run mutmut run. Check the results with mutmut results. How many mutants does the weak test set kill vs the strong test set? This is mutation score — the true measure of test quality.

Look at your own codebase's coverage report. Find a file with 100% line coverage but low branch coverage. What branch conditions are untested? Write one test that exercises a missing branch.

Research the difference between line coverage, branch coverage, and path coverage. For if a and b: — how many paths exist (a=T/b=T, a=T/b=F, a=F), and how many lines? What's the minimum number of tests for 100% branch coverage?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain why 100% test coverage is not the goal, and what mutation testing reveals that coverage doesn't. Give a concrete example of a test that achieves line coverage without being useful.

2. Why it works (the mechanism)

Walk me through how mutation testing works: what is a mutant, how is it generated, what does it mean to 'kill' a mutant vs have a 'surviving' mutant, and what does a high mutation score tell you about your test suite?

3. Advanced — application & what's next

My team's coverage report shows 82% but I suspect the tests are weak (many assertions are `assert result is not None` or `assert len(result) > 0`). Walk me through: how to identify weak assertions in a code review, how to set up mutation testing in CI as a quality gate, what mutation score threshold is reasonable (60%? 80%? 95%?), and how to prioritize which modules to strengthen first.