Tokenization at serve time

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Before the model sees text, a tokenizer converts it to token ids; after, it converts ids back to text. At serve time this matters more than people expect: tokenization cost, how partial tokens are handled during streaming, and how token counts (not characters) drive billing and context limits. A serving engineer who understands tokenization avoids a class of bugs — broken multi-byte characters mid-stream, off-by-token context overflows, surprise costs from token-dense inputs. It's the unglamorous boundary layer that wraps every request and response.

Demo

The demo shows tokenize/detokenize round-tripping and the streaming gotcha: a single character (like an emoji) can span multiple tokens, so naively decoding token-by-token during streaming produces broken output unless you buffer.

Try it yourself

Tokenize an emoji or a CJK string and confirm a single character spans multiple tokens.
Decode those tokens one-at-a-time and concatenate; watch the output corrupt — then fix it with buffering.
Measure tokens for code vs. prose vs. JSON (recall Course E) and connect to billing/limits.
Reason about why the tokenizer must match the model exactly, and what breaks if it doesn't.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

What does a tokenizer do at serve time, and why does it matter for streaming and billing?

2. Why it works (the mechanism)

Explain why a single visible character can span multiple tokens and how that breaks naive token-by-token stream decoding, plus the fix.

3. Advanced — application & what's next

Walk me through the serving-time tokenization concerns: incremental decoding for streaming, token-vs-character accounting for limits/billing, and why the tokenizer must exactly match the model.

References

When the model call fails. Read the error and decide: fix the request, retry, or fall back. 400/422 (bad params, context-length exceeded), 401/403 (auth / no access to that model), 404 (wrong model id) are fatal — fix and don't retry. 429, 500/502/503, Anthropic 529 (overloaded), and timeouts are transient — retry with backoff. Watch for non-HTTP failures too: finish_reason: "length" truncation (raise max_tokens or continue), safety refusals, malformed JSON / failed tool-call parsing (validate against a schema and repair-retry), and mid-stream disconnects. Always log the provider with the error so you can trace it later.