Capstok — learn by doing

Why this matters

The bot will make thousands of decisions before you know whether it's helping or hurting. Without a log, you'll form your judgment from three or four salient memories — probably the two times it embarrassed you and the one time it saved you an hour — and your policy will drift on vibes. Reading the meter means designing the log FIRST, before any decision code, so that every action the bot takes writes a structured line: {timestamp, action_type, rung, stakes, outcome, human_override?, tokens_spent, latency_ms}. Six weeks in, this log answers questions your intuition can't: what fraction of AUTO actions get later overridden (should have been NOTIFY); which action types have the worst outcome distribution (should be paused); how much you're actually spending per action (is the ROI real). Every module in this course will emit at least one log line per action, and Module 10 is about actually reading them.

Demo

A minimum-viable log line + a query that turns 5,000 log lines into an actionable decision. Structured, append-only, cheap.

Try it yourself

Wire log() into any existing script you have that calls an LLM. Run it 20 times and read the log.
Add a notes field for free-text observations and use it whenever you override an action.
Compute cost/action from tokens_in + tokens_out at your model's per-token price — see if the actual cost matches your headroom.
Write the query that would tell you 'which action_type is my highest cost per week and lowest override rate' — that's the one to double down on.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain why a structured action log matters more than the bot's decision code.

2. Why it works (the mechanism)

Walk me through how a 5,000-row action log answers questions that my gut memory can't — availability bias, salience, and log-scale outcome distributions.

3. Advanced — application & what's next

My bot has been running for 4 weeks. Design me 3 specific SQL queries against the log that would tell me: (1) which actions to promote to AUTO, (2) which to demote to ASK, (3) what my per-week bot budget should be.

References

Chat about this lesson

import json, os, time
from dataclasses import dataclass, asdict
from typing import Optional

LOG_PATH = "bot_actions.jsonl"

@dataclass
class ActionLog:
    ts: float
    action_type: str
    rung: str
    stakes: str
    outcome: str
    human_override: Optional[str]  # "approved" | "edited" | "canceled" | None
    tokens_in: int
    tokens_out: int
    latency_ms: int
    cost_usd: float

def log(entry: ActionLog):
    with open(LOG_PATH, "a") as f:
        f.write(json.dumps(asdict(entry)) + "\n")

def read_log():
    if not os.path.exists(LOG_PATH):
        return []
    with open(LOG_PATH) as f:
        return [json.loads(line) for line in f]

# Simulate a few actions
for i in range(10):
    log(ActionLog(
        ts=time.time() + i,
        action_type="draft_reply",
        rung="NOTIFY",
        stakes="medium",
        outcome="clean" if i != 3 else "user_override",
        human_override="edited" if i == 3 else None,
        tokens_in=1200, tokens_out=250,
        latency_ms=1800,
        cost_usd=0.0042,
    ))

# The one query that matters: what's the human-override rate per rung?
rows = read_log()
from collections import Counter, defaultdict
overrides = defaultdict(lambda: {"total": 0, "overridden": 0})
for r in rows:
    key = (r["action_type"], r["rung"])
    overrides[key]["total"] += 1
    if r["human_override"] is not None:
        overrides[key]["overridden"] += 1

for (a, rung), s in sorted(overrides.items()):
    pct = 100 * s["overridden"] / s["total"] if s["total"] else 0
    print(f"{a:20s} rung={rung:6s} overrides={pct:4.1f}% ({s['overridden']}/{s['total']})")

# Rule of thumb: if AUTO gets >5% overridden, downgrade to NOTIFY.

Run: python3 main.py

Reading the meter: what to log so future-you knows what worked