Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiejcr7xeg63agihffithzu57yib2bvxbq77wckxtm5gcepg54tsaq",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mp45ponqad42"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreicv7gxryylimypzg2g4pluhej4dorl3kf6ux4jiepczvciqhtcu3e"
    },
    "mimeType": "image/webp",
    "size": 75328
  },
  "path": "/muhammadzainnaseer/how-to-put-an-llm-in-your-product-without-wrecking-your-costs-or-your-latency-89a",
  "publishedAt": "2026-06-25T09:21:40.000Z",
  "site": "https://dev.to",
  "tags": [
    "ai",
    "machinelearning",
    "webdev",
    "programming"
  ],
  "textContent": "Adding an AI feature looks deceptively easy. You sign up for an API key, paste in a prompt, and within an hour you've got a working demo that makes the whole team lean over your shoulder. Then you ship it, traffic arrives, and two things happen at once: your latency graph develops a long, ugly tail, and your monthly bill arrives with a number that makes finance schedule a meeting.\n\nThe gap between \"impressive demo\" and \"production feature\" is almost entirely about cost and latency engineering. The model is the easy part. Here's how to cross that gap.\n\n##  First, understand what you're actually paying for\n\nMost LLM APIs bill by **tokens** — roughly ¾ of a word each — and they bill _both_ directions: the tokens you send (input) and the tokens the model generates (output). Output tokens are usually several times more expensive than input tokens, which has a non-obvious consequence: a verbose prompt is cheaper than a verbose answer.\n\nThis reframes optimization. People obsess over trimming their prompts while letting the model ramble for 800 tokens when 80 would do. If you want to cut cost, the highest-leverage move is almost always **constraining the output** : ask for JSON, ask for a single sentence, set a `max_tokens` ceiling, and tell the model explicitly to be terse.\n\nLatency follows the same logic. Generation is sequential — the model produces one token at a time — so output length is the single biggest driver of how long a request takes. A 50-token answer is fast almost regardless of model. A 2,000-token answer is slow even on the fastest infrastructure.\n\n##  Lever 1: Don't call the model when you don't have to\n\nThe cheapest, fastest LLM call is the one you never make. Two techniques eliminate a startling share of traffic.\n\n**Caching identical and near-identical requests.** Many real-world prompts repeat — the same FAQ-style question, the same document summarized twice, the same classification of similar inputs. A cache keyed on the normalized prompt turns a repeat request into a sub-millisecond lookup. For exact repeats, a simple key-value cache works. For _similar_ requests, a semantic cache — where you embed the query and return a cached answer if a previous query is close enough in vector space — can absorb far more traffic, at the cost of some tuning.\n\n**Routing to the right tier.** You do not need your most capable model for every task. Classifying a support ticket into one of five buckets is a job for a small, cheap, fast model. Drafting a nuanced customer email is worth the premium one. A simple router — even a keyword or length heuristic before anything fancy — that sends easy work to a cheap model and hard work to an expensive one can cut spend dramatically without anyone noticing a quality drop.\n\n##  Lever 2: Make latency feel lower than it is\n\nSometimes you genuinely need a long, high-quality response, and it's genuinely going to take a few seconds. You can't always make it faster — but you can make it _feel_ fast, which is often what actually matters to the user.\n\n**Stream the response.** Instead of waiting for the full answer and dumping it at once, stream tokens as they're generated. The user starts reading after a few hundred milliseconds, and the perceived wait collapses even though total generation time is unchanged. This is the single highest-impact UX change for any chat-style feature, and most SDKs support it with a one-line change.\n\n**Show honest progress for non-streamed work.** If you're doing something multi-step — retrieve, then reason, then format — tell the user what's happening (\"Searching your documents…\", \"Drafting an answer…\"). A visible, truthful status beats a spinner that gives no information about whether anything is working.\n\n##  Lever 3: Control the worst case, not just the average\n\nYour average latency is a comforting lie. LLM endpoints have heavy tails: most requests are fine, but a meaningful slice take 3–5× longer, and a few time out entirely. If your product blocks on those, a small fraction of slow requests can dominate the experience.\n\nDefend against the tail explicitly:\n\n  * **Set aggressive timeouts** and decide in advance what happens when you hit one — a cached fallback, a smaller model, a graceful \"try again\" — rather than letting the request hang.\n  * **Add a retry with backoff** for transient failures, but cap it. Infinite retries against an overloaded provider just make the outage worse.\n  * **Add a circuit breaker** for sustained failures. If the provider is clearly down, fail fast to your fallback instead of sending every user into a 30-second wait.\n\n\n\nThese aren't AI-specific patterns — they're the same resilience engineering you'd apply to any external dependency. The mistake is treating the LLM as magic instead of as what it is: a slow, occasionally flaky network call to someone else's servers.\n\n##  Lever 4: Measure the things that actually move\n\nYou can't optimize what you don't track. From day one, log three numbers per request: **input tokens, output tokens, and end-to-end latency.** Tag them by feature and by model. Within a week you'll have a cost-and-latency breakdown by feature, and it will almost certainly surprise you — there's usually one endpoint quietly responsible for most of the bill, and it's rarely the one you'd guess.\n\nA useful derived metric is **cost per successful user outcome** , not cost per API call. A feature that calls the model twice but actually solves the user's problem is cheaper, in every way that matters, than one that calls it once and gets ignored.\n\n##  The mindset shift\n\nThe teams that ship AI features sustainably stop thinking of the model as the product and start thinking of it as an expensive, high-variance dependency they're responsible for managing. The prompt gets you the demo. Caching, routing, streaming, and tail control get you a feature you can afford to keep running.\n\nNone of it is exotic. It's the same discipline that makes any external service production-ready — applied to a service that happens to charge by the word and answer at the speed of thought, one token at a time.",
  "title": "How to Put an LLM in Your Product Without Wrecking Your Costs or Your Latency"
}