Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibnkx33di7j5gp4h75m6phuzbcmkcfpofrk2xvwkssc6f74qluuwq",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mouzdlkr7yj2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreichuv3udgwo24zswyy5izm7hemgephd73mtvgnvlinq6sq22yfqeq"
    },
    "mimeType": "image/webp",
    "size": 73836
  },
  "path": "/claire_nguyen/semantic-caching-our-flaky-test-summariser-58-fewer-llm-calls-3c53",
  "publishedAt": "2026-06-22T13:22:33.000Z",
  "site": "https://dev.to",
  "tags": [
    "sre",
    "devops",
    "llm",
    "mlops",
    "semantic caching",
    "same OpenAI-compatible API",
    "fallback",
    "Bifrost semantic caching docs",
    "Retries and fallbacks",
    "Drop-in replacement guide",
    "Bifrost on GitHub",
    "Gateway setup"
  ],
  "textContent": "**TL;DR: Our internal flaky-test summariser at Buildkite was firing ~40k LLM calls a day, and most were near-duplicates of failures we'd already explained. Switching on semantic caching in Bifrost cut live provider calls by 58% and dropped p50 latency on cache hits from ~900ms to about 40ms. It also kept the feature alive when our primary provider browned out for 11 minutes.**\n\n##  The feature that wouldn't shut up\n\nOn our platform team (eight of us) we shipped a small thing last quarter: when a test goes flaky in a Buildkite pipeline, we pass the failure output to an LLM and stick a plain-English summary on the build page. Devs liked it. The provider bill less so.\n\nBy March it was making roughly 40,000 calls a day against `anthropic/claude-haiku`, with `openai/gpt-4o-mini` as the fallback. p50 latency sat around 900ms. The monthly bill crept past $310. Not catastrophic. But the calls were doing the same work over and over.\n\n##  Why the calls were so repetitive\n\nHere's the bit that bugged me. Flaky tests are flaky for the same reasons across builds. A timeout in `payments_spec.rb` looks almost identical on Tuesday as it did on Monday, minus a timestamp and a container ID.\n\nSo we were paying full freight to summarise text we'd already summarised. Different bytes, same meaning. A normal key-based cache misses all of these because the strings never match exactly. That's the whole problem semantic caching solves: it matches on meaning, not on an md5 of the prompt.\n\nWe already ran everything through Bifrost as our gateway, mostly for the automatic failover. Turns out the semantic caching was sitting right there.\n\n##  Turning it on\n\nBifrost runs as a single Go binary in front of our summariser. We added the cache plugin to the gateway config and pointed it at a small embedding model so we weren't paying much per lookup.\n\n\n\n    {\n      \"plugins\": [\n        {\n          \"name\": \"semantic_cache\",\n          \"config\": {\n            \"embedding_model\": \"openai/text-embedding-3-small\",\n            \"threshold\": 0.92,\n            \"ttl_seconds\": 86400\n          }\n        }\n      ]\n    }\n\n\nThe `threshold` is the knob that matters. At 0.92 cosine similarity two failures have to be genuinely close before we serve a cached summary. We started at 0.97, which was too strict (hit rate sat around 20%), and walked it down while spot-checking summaries against the real failures.\n\nSettled on 0.92. Cache hit rate landed at 58% over the first three weeks. On a hit, the summariser returns in ~40ms instead of waiting on a provider round trip. No code change in our app, since Bifrost speaks the same OpenAI-compatible API we already called.\n\n##  What the brownout taught us\n\nTwo weeks in, our primary provider had a rough afternoon. Elevated errors and timeouts for 11 minutes. Normally Bifrost's fallback kicks the traffic to the secondary, which it did.\n\nBut the cache did something I hadn't planned for. More than half the requests during that window never reached either provider, because they matched recent failures already in the cache. The blast radius shrank on its own. The fallback handled the genuinely new failures, the cache absorbed the repeats, and nobody filed a ticket. She'll be right, basically.\n\nThat's the reliability angle people miss with caching. It's not only a cost lever. It's load shedding you get for free when an upstream goes wobbly.\n\n##  Bifrost vs LiteLLM vs Portkey\n\nWe looked at the obvious alternatives before committing. All three can do semantic caching. They're not the same.\n\nCapability | Bifrost | LiteLLM | Portkey\n---|---|---|---\nSemantic cache | Built in, config-driven | Yes, via Redis + embeddings | Yes, mature\nFailover + cache together | Single binary | Proxy + Redis to wire up | SaaS, polished\nSelf-host | Go binary, Docker | Python proxy | Self-host or cloud\nDashboard | Built-in web UI | Community UI | Strongest of the three\nProvider breadth | 23+ | Very broad | Broad\n\nHonest read: LiteLLM has the bigger community and the widest provider list, and if you already run Redis their cache is well-trodden. Portkey's dashboard and analytics are the slickest of the lot, and for a team that wants a managed SaaS it's hard to argue against.\n\nWe picked Bifrost because we self-host on ECS and wanted the failover and the cache in one Go process, not a Python proxy plus a Redis we'd have to babysit. Fewer moving parts to break on a game day.\n\n##  Trade-offs and Limitations\n\nSemantic caching isn't free of sharp edges, and pretending otherwise would be daft.\n\nThe threshold is a real risk. Set it too loose and you'll serve a summary from a _different_ failure that happens to read similarly. We caught two of these at 0.88 during tuning. A bad summary on a build page erodes trust fast, so we erred conservative at 0.92 and accept a lower hit rate for it.\n\nEmbeddings add a little latency and cost on every lookup, including misses. With `text-embedding-3-small` it's small, but it's not zero. For workloads where every input is genuinely unique, you'll pay the embedding tax and get almost nothing back.\n\nCache invalidation is on you. When we changed the summariser's prompt, every cached entry was suddenly stale against the new format. We dropped the TTL to 24 hours so the cache rolls over daily rather than holding stale shapes for a week.\n\nAnd it doesn't replace failover. The cache helped during the brownout, but only because we had recent traffic. Cold cache plus dead provider equals a bad time. Keep your fallback chain regardless.\n\n##  Further Reading\n\n  * Bifrost semantic caching docs\n  * Retries and fallbacks\n  * Drop-in replacement guide\n  * Bifrost on GitHub\n  * Gateway setup\n\n",
  "title": "Semantic caching our flaky-test summariser: 58% fewer LLM calls"
}