Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
The decision to run your own serving stack is ultimately economic, and the cost is more than GPU rental. It includes the engineering time to build and maintain the stack, the on-call burden, the idle capacity you pay for to meet peak SLAs, and the opportunity cost of not shipping product features. A hosted API has a clear per-token price; a self-hosted stack trades that for fixed infrastructure plus human cost that only pays off at scale. Being able to model total cost of ownership honestly — including the parts that don't show up on the GPU invoice — is what makes the build-vs-buy decision defensible to leadership and keeps you from self-hosting a workload that a hosted API would serve more cheaply.
The demo builds a simple TCO model comparing a hosted API's per-token cost against a self-hosted stack's GPU plus amortized engineering cost, revealing the break-even volume.
Use these three in order. Each builds on the one before.
What goes into the total cost of owning a self-hosted serving stack beyond GPU rental?
Walk me through a TCO model comparing a hosted API's per-token price to a self-hosted stack's GPU plus amortized engineering and on-call cost.
Given my token volume, GPU costs, team size, and utilization, compute the break-even point where self-hosting beats a hosted API, and name the hidden costs that most often flip the decision.
def tco_compare(monthly_tokens_millions, hosted_price_per_m, gpu_monthly, gpus,
eng_monthly, utilization=0.5):
hosted = monthly_tokens_millions * hosted_price_per_m
# self-hosted: GPUs (paid whether busy or idle) + amortized engineering/on-call
self_hosted = gpus * gpu_monthly + eng_monthly
return {"hosted_usd": round(hosted), "self_hosted_usd": round(self_hosted),
"cheaper": "hosted" if hosted < self_hosted else "self-hosted"}
# Low volume: hosted wins. High volume: self-hosted amortizes the fixed cost.
print("low:", tco_compare(50, hosted_price_per_m=0.5, gpu_monthly=1500, gpus=2, eng_monthly=8000))
print("high:", tco_compare(50000, hosted_price_per_m=0.5, gpu_monthly=1500, gpus=8, eng_monthly=8000))python3 main.py