Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Before the model sees text, a tokenizer converts it to token ids; after, it converts ids back to text. At serve time this matters more than people expect: tokenization cost, how partial tokens are handled during streaming, and how token counts (not characters) drive billing and context limits. A serving engineer who understands tokenization avoids a class of bugs — broken multi-byte characters mid-stream, off-by-token context overflows, surprise costs from token-dense inputs. It's the unglamorous boundary layer that wraps every request and response.
The demo shows tokenize/detokenize round-tripping and the streaming gotcha: a single character (like an emoji) can span multiple tokens, so naively decoding token-by-token during streaming produces broken output unless you buffer.
Use these three in order. Each builds on the one before.
What does a tokenizer do at serve time, and why does it matter for streaming and billing?
Explain why a single visible character can span multiple tokens and how that breaks naive token-by-token stream decoding, plus the fix.
Walk me through the serving-time tokenization concerns: incremental decoding for streaming, token-vs-character accounting for limits/billing, and why the tokenizer must exactly match the model.
When the model call fails. Read the error and decide: fix the request, retry, or fall back.
400/422(bad params, context-length exceeded),401/403(auth / no access to that model),404(wrong model id) are fatal — fix and don't retry.429,500/502/503, Anthropic529(overloaded), and timeouts are transient — retry with backoff. Watch for non-HTTP failures too:finish_reason: "length"truncation (raisemax_tokensor continue), safety refusals, malformed JSON / failed tool-call parsing (validate against a schema and repair-retry), and mid-stream disconnects. Always log the provider with the error so you can trace it later.
# Streaming gotcha: one visible character can be multiple tokens; decode incrementally with care.
# (Conceptual — real engines handle this; you must too if you post-process streams.)
def stream_decode(tokenizer, token_stream):
buffer_ids, out = [], ""
for tid in token_stream:
buffer_ids.append(tid)
text = tokenizer.decode(buffer_ids)
if not text.endswith("\ufffd"): # \ufffd = incomplete multi-byte char
out += text; buffer_ids = [] # flush only complete characters
return out
# If you decode each token independently and concatenate, multi-byte chars (emoji, CJK) break.python3 main.py