The three taxes: cost, latency, and trust

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Adding AI to a product is never free — every AI feature levies three taxes that a plain feature doesn't. There's a money tax (you pay per token, every call, forever), a latency tax (model calls are slow, often seconds, breaking the snappy interactions users expect), and a trust tax (the feature can be confidently wrong, and one bad answer poisons the whole feature's credibility). You must price all three into the decision up front, because a feature that's 'cool' but adds a 4-second wait, costs a dollar per use, and is wrong 10% of the time is a net negative. Naming these taxes turns vague unease into a checklist.

Try it yourself

Compute the monthly per-user cost of your top feature at your real call volume.

Put a number on p95 latency and decide if it breaks the interaction's flow.

Estimate the feature's error rate and what one wrong answer costs in user trust.

Find the break-even: how much value per use must the feature deliver to be net-positive?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain the three hidden costs (cost, latency, trust) of adding AI to a product.

2. Why it works (the mechanism)

Walk me through how each of cost, latency, and trust actually accrues in a production AI feature.

3. Advanced — application & what's next

Given a feature with great value but high latency and a 10% error rate, how would you decide whether to ship, redesign, or kill it?

References

Working within free-tier limits. Free / low-tier provider keys rate-limit aggressively, and eval or agent loops that fan out calls will hit 429 Too Many Requests fast. Survive it: read Retry-After and the x-ratelimit-* headers and back off (exponential backoff with jitter + a max-retry cap) instead of hammering; cap in-flight requests with a small concurrency limiter so you stay under the RPM/TPM ceiling; cache identical requests so retries don't re-spend quota; downshift to a smaller/cheaper model for practice runs; use the provider Batch API for non-interactive jobs; or sidestep hosted limits entirely by running a small model locally (Ollama / llama.cpp) or on a free Colab/Kaggle GPU while you learn.