Capstok — learn by doing

Why this matters

Frequency analysis is the oldest cryptanalytic technique we have — al-Kindi described it in 9th-century Baghdad, and it works against every cipher that preserves the structure of the underlying language. Its existence is the reason every cipher we still use is built to look statistically uniform: AES output should be indistinguishable from random bytes, ChaCha20 keystream should pass every randomness test we can throw at it. Understanding what frequency analysis can recover and what it cannot — it shreds simple substitution but reveals nothing if the ciphertext is genuinely uniform — is how you internalise what 'indistinguishable from random' actually means as a security goal. Every time you see a paper proving a scheme is 'IND-CPA secure,' the underlying intuition is: an attacker armed with frequency analysis (and far more) cannot do better than guessing.

Demo

English letter frequencies are stable across large samples: E ≈ 12.7%, T ≈ 9.1%, A ≈ 8.2%, and Z ≈ 0.07%. A monoalphabetic substitution cipher preserves these frequencies under a permutation, so the most common ciphertext letter is almost certainly E.

\chi^2(C) = \sum_{i=0}^{25} \frac{(\text{obs}_i - \text{exp}_i)^2}{\text{exp}_i}

Try it yourself

Take a paragraph of English text (~500 chars) and compute its letter frequencies — verify that E, T, A, O are in the top 4 with frequencies near 12.7%, 9.1%, 8.2%, 7.5%.
Encrypt that paragraph with caesar_encrypt(pt, 11) and run best_caesar_shift on the ciphertext — confirm it returns 11.
Apply a random substitution to the same paragraph and compute its chi-squared score — verify it is much higher than English (since the letters got shuffled).
Take a Vigenère-encrypted paragraph with key length 5 and split the ciphertext into 5 streams (positions 0,5,10,...; 1,6,11,...; etc.) — compute frequencies on each stream and confirm each stream alone resembles a Caesar shift of English.
Compute the chi-squared score of pure random ASCII letters (use random.choices) and confirm it is much higher than English text — this is what 'indistinguishable from random' looks like to the test.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain what frequency analysis is and why letter frequencies in English are stable enough that an attacker can use them as a fingerprint.

2. Why it works (the mechanism)

Walk me through how chi-squared analysis is used to score a candidate decryption against the English frequency distribution, and how that score lets you rank all 26 Caesar shifts to pick the most likely plaintext.

3. Advanced — application & what's next

Given that frequency analysis trivially breaks any substitution cipher, what design property must a modern symmetric cipher like AES have so that frequency analysis on the ciphertext yields no information? Connect this to the formal goal of pseudorandomness.

References

// main.go
package main

import (
	"fmt"
	"strings"
	"unicode"
)

var englishFreq = map[rune]float64{
	'A': 8.167, 'B': 1.492, 'C': 2.782, 'D': 4.253, 'E': 12.702,
	'F': 2.228, 'G': 2.015, 'H': 6.094, 'I': 6.966, 'J': 0.153,
	'K': 0.772, 'L': 4.025, 'M': 2.406, 'N': 6.749, 'O': 7.507,
	'P': 1.929, 'Q': 0.095, 'R': 5.987, 'S': 6.327, 'T': 9.056,
	'U': 2.758, 'V': 0.978, 'W': 2.360, 'X': 0.150, 'Y': 1.974,
	'Z': 0.074,
}

func chiSquared(text string) float64 {
	upper := strings.Map(func(r rune) rune {
		if unicode.IsLetter(r) {
			return unicode.ToUpper(r)
		}
		return -1
	}, text)

	n := float64(len(upper))
	counts := make(map[rune]int)
	for _, c := range upper {
		counts[c]++
	}

	chi2 := 0.0
	for ch, expectedPct := range englishFreq {
		expected := expectedPct * n / 100
		observed := float64(counts[ch])
		if expected > 0 {
			diff := observed - expected
			chi2 += diff * diff / expected
		}
	}
	return chi2
}

// caesarShift decodes ciphertext by shifting each letter back by k positions
func caesarShift(s string, k int) string {
	return strings.Map(func(r rune) rune {
		if unicode.IsLetter(r) {
			return rune((int(unicode.ToUpper(r))-65-k+26)%26 + 65)
		}
		return r
	}, strings.ToUpper(s))
}

// Find best Caesar shift by chi-squared minimisation
func bestCaesarShift(ciphertext string) int {
	bestShift := 0
	bestScore := -1.0
	for k := 0; k < 26; k++ {
		score := chiSquared(caesarShift(ciphertext, k))
		if bestScore < 0 || score < bestScore {
			bestScore = score
			bestShift = k
		}
	}
	return bestShift
}

func main() {
	ct := "WKLV LV D VHFUHW PHVVDJH"
	fmt.Println("Best shift:", bestCaesarShift(ct)) // 3
}

Run: go run main.go