Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifa5ioca7i7qw44iptprv6oze4y6rhhb5h3wfzydt22uqzmsyzttq",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moruocnncn22"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreif5p27ews47xxyxn5sntikbspg5e3s3aeg4u2227ywokshmdhaxtm"
    },
    "mimeType": "image/webp",
    "size": 88656
  },
  "path": "/machinecodingmaster/stop-wasting-llm-budgets-high-performance-semantic-caching-with-spring-ai-and-pgvector-2n1o",
  "publishedAt": "2026-06-21T07:17:35.000Z",
  "site": "https://dev.to",
  "tags": [
    "java",
    "ai",
    "llm",
    "systemdesign",
    "javalld.com",
    "@Override"
  ],
  "textContent": "##  Stop Wasting LLM Budgets: High-Performance Semantic Caching with Spring AI and pgvector\n\nYour enterprise is likely bleeding thousands of dollars on duplicate LLM API calls because your Redis cache fails when a user asks \"How do I reset my password?\" instead of \"Password reset steps.\" In 2026, relying on exact-string matching for LLM caching is a rookie mistake that kills both your latency and your budget.\n\n##  Why Most Developers Get This Wrong\n\n  * **Exact-Match Obsession:** Using traditional Redis or Memcached key-value pairs, which completely misses semantically identical queries with different wordings.\n  * **Database Abuse:** Hand-rolling vector math inside the application layer instead of letting `pgvector` perform native, hardware-accelerated cosine distance queries.\n  * **Network Bloat:** Calling external APIs (like OpenAI) to embed the user's query _before_ checking the cache, defeating the low-latency purpose of caching.\n\n\n\n##  The Right Way\n\nIntercept LLM calls at the framework level using Spring AI Advisors paired with a local embedding model and a pgvector-backed similarity search.\n\n  * **Use Spring AI Advisors:** Implement a custom `CallAroundAdvisor` to transparently intercept prompts before they hit the external LLM provider.\n  * **Local Embeddings:** Use a local ONNX model (like `all-MiniLM-L6-v2`) inside your JVM process to generate query embeddings in under 5ms, avoiding external network hops.\n  * **Cosine Distance Thresholding:** Query PostgreSQL using `pgvector` with an HNSW index, filtering results with a strict similarity threshold (e.g., `> 0.96`).\n\n\n\n##  Show Me The Code\n\nHere is how to implement a high-performance, reusable semantic cache advisor using Spring AI:\n\n\n\n    public class SemanticCacheAdvisor implements CallAroundAdvisor {\n        private final PgVectorStore vectorStore;\n        private final double similarityThreshold = 0.96;\n\n        @Override\n        public AdvisedResponse aroundCall(AdvisedRequest request, CallAroundAdvisorChain chain) {\n            String query = request.getPrompt().getInstructions().get(0).getContent();\n            var matches = vectorStore.similaritySearch(\n                SearchRequest.query(query).withSimilarityThreshold(similarityThreshold).withTopK(1)\n            );\n            if (!matches.isEmpty()) {\n                return AdvisedResponse.from(matches.get(0).getMetadata().get(\"cached_response\").toString());\n            }\n            AdvisedResponse response = chain.nextAroundCall(request);\n            var cachedDoc = new Document(query, Map.of(\"cached_response\", response.getMessage()));\n            vectorStore.add(List.of(cachedDoc));\n            return response;\n        }\n    }\n\n\n##  Key Takeaways\n\n  * **Decouple Caching:** Keep your business logic clean; use Spring AI's `Advisor` chain to handle semantic caching transparently without polluting your services.\n  * **Index for Scale:** Always create an HNSW index on your `pgvector` columns to maintain sub-10ms query times as your cache grows to millions of rows.\n  * **Set Strict Thresholds:** Keep your similarity threshold high (0.95+) to prevent \"hallucinated\" cache hits where distinct user intents are incorrectly matched.\n\n\n\n> I built javalld.com while prepping for senior roles — complete LLD problems with execution traces, not just theory.",
  "title": "Stop Wasting LLM Budgets: High-Performance Semantic Caching with Spring AI and pgvector"
}