Production LLMs: what actually matters before you scale
6 min read
Latency, evals, and governance beat model hype. Here’s the short list I use before recommending scale.
Everyone wants the newest model name in the slide deck. In production, the winning teams obsess over things that sound boring: latency distributions, failure modes, and who gets paged when something drifts.
Start from the promise, not the API
Write down the user-visible promise in one sentence. Then list the ways the system can lie politely — vague answers, outdated facts, tool misuse. If you can’t test for those, you don’t have a product yet; you have a demo.
Observability is part of the feature
Traces should flow from prompt → retrieval → model → tools. Dashboards should answer: Are we slower this week? Are costs spiking for one tenant? Is quality degrading on a specific intent class? Without that, you’re flying blind at the worst possible time — when usage grows.
- Golden sets + live sampling for regression
- Latency SLOs with tail awareness (p95/p99)
- Cost visibility per tenant or use case
Scale the system when the signals say you’re learning faster than you’re guessing. Until then, optimize for learning and safety — not vanity throughput.
Want to go deeper on this topic for your team or product?
Get in touch