Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
100% test coverage doesn't mean your tests are good. It means every line was executed — but a test that calls calculate_discount(100) and asserts result is not None achieves line coverage without testing anything meaningful. Coverage is a floor, not a ceiling: below 60% is a red flag, but 100% is not the goal. Mutation testing is the real measure of test quality: it introduces deliberate bugs (mutations) into your code — changing > to >=, replacing + with -, deleting a line — and checks whether your tests catch them. A test suite with 95% coverage that kills 40% of mutants is much weaker than an 80% coverage suite that kills 85% of mutants.
pytest-cov tells you which lines were executed during testing, but execution is not the same as verification: a test that calls a function and only asserts result is not None achieves line coverage while testing almost nothing. Mutation testing closes this gap by injecting deliberate bugs into your source — flipping > to >=, removing a negation, deleting a return — and checking whether any test fails as a result. A surviving mutant is a test gap: the code was broken and nobody noticed, which is exactly the kind of undetected defect that reaches production.
# pip install pytest pytest-cov
# The code under test
def calculate_shipping(weight_kg: float, is_express: bool) -> float:
"""Shipping cost: $5 base + $2/kg, express is 2× total. Free over $50."""
if weight_kg <= 0:
raise ValueError("Weight must be positive")
base = 5.0 + (2.0 * weight_kg)
if is_express:
base *= 2
if base >= 50.0:
return 0.0 # free shipping
return round(base, 2)
# ── Weak tests (100% line coverage but misses logic errors) ──────────────────
def test_standard_shipping_WEAK():
result = calculate_shipping(5.0, False)
assert result > 0 # ← passes even if formula is wrong
def test_express_shipping_WEAK():
result = calculate_shipping(5.0, True)
assert result != calculate_shipping(5.0, False) # ← just checks it's different
# ── Strong tests (actual values, each branch and boundary) ───────────────────
def test_standard_shipping_exact_value():
# Arrange: 5kg standard → base = 5 + 2*5 = 15
assert calculate_shipping(5.0, False) == 15.0
def test_express_doubles_the_total():
# base = 15.0 → express = 30.0
assert calculate_shipping(5.0, True) == 30.0
def test_free_shipping_at_threshold():
# 22.5kg standard: 5 + 45 = 50 → free
assert calculate_shipping(22.5, False) == 0.0
def test_just_below_free_shipping_threshold():
# 22.4kg: 5 + 44.8 = 49.8 → NOT free (boundary value)
assert calculate_shipping(22.4, False) == 49.8
def test_zero_weight_raises():
import pytest
with pytest.raises(ValueError):
calculate_shipping(0, False)
# Run: pytest --cov=. --cov-report=term-missing
# Both sets reach 100% line coverage — but the weak tests miss the formula bugs
# To find the difference, run mutation testing:
# pip install mutmut && mutmut run && mutmut resultspython3 main.pypytest --cov=. --cov-report=term-missing on both the weak and strong test sets. Do both show 100% line coverage? Now change >= to > in the free-shipping threshold check. Which test set catches the bug?mutmut (pip install mutmut) and run mutmut run. Check the results with mutmut results. How many mutants does the weak test set kill vs the strong test set? This is mutation score — the true measure of test quality.if a and b: — how many paths exist (a=T/b=T, a=T/b=F, a=F), and how many lines? What's the minimum number of tests for 100% branch coverage?Use these three in order. Each builds on the one before.
In one paragraph, explain why 100% test coverage is not the goal, and what mutation testing reveals that coverage doesn't. Give a concrete example of a test that achieves line coverage without being useful.
Walk me through how mutation testing works: what is a mutant, how is it generated, what does it mean to 'kill' a mutant vs have a 'surviving' mutant, and what does a high mutation score tell you about your test suite?
My team's coverage report shows 82% but I suspect the tests are weak (many assertions are `assert result is not None` or `assert len(result) > 0`). Walk me through: how to identify weak assertions in a code review, how to set up mutation testing in CI as a quality gate, what mutation score threshold is reasonable (60%? 80%? 95%?), and how to prioritize which modules to strengthen first.