All posts
LLMProductionMLOps

Production LLMs: what actually matters before you scale

6 min read

Latency, evals, and governance beat model hype. Here’s the short list I use before recommending scale.

Everyone wants the newest model name in the slide deck. In production, the winning teams obsess over things that sound boring: latency distributions, failure modes, and who gets paged when something drifts.

Start from the promise, not the API

Write down the user-visible promise in one sentence. Then list the ways the system can lie politely — vague answers, outdated facts, tool misuse. If you can’t test for those, you don’t have a product yet; you have a demo.

Observability is part of the feature

Traces should flow from prompt → retrieval → model → tools. Dashboards should answer: Are we slower this week? Are costs spiking for one tenant? Is quality degrading on a specific intent class? Without that, you’re flying blind at the worst possible time — when usage grows.

  • Golden sets + live sampling for regression
  • Latency SLOs with tail awareness (p95/p99)
  • Cost visibility per tenant or use case

Scale the system when the signals say you’re learning faster than you’re guessing. Until then, optimize for learning and safety — not vanity throughput.

Want to go deeper on this topic for your team or product?

Get in touch