{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifa5ioca7i7qw44iptprv6oze4y6rhhb5h3wfzydt22uqzmsyzttq",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moruocnncn22"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreif5p27ews47xxyxn5sntikbspg5e3s3aeg4u2227ywokshmdhaxtm"
},
"mimeType": "image/webp",
"size": 88656
},
"path": "/machinecodingmaster/stop-wasting-llm-budgets-high-performance-semantic-caching-with-spring-ai-and-pgvector-2n1o",
"publishedAt": "2026-06-21T07:17:35.000Z",
"site": "https://dev.to",
"tags": [
"java",
"ai",
"llm",
"systemdesign",
"javalld.com",
"@Override"
],
"textContent": "## Stop Wasting LLM Budgets: High-Performance Semantic Caching with Spring AI and pgvector\n\nYour enterprise is likely bleeding thousands of dollars on duplicate LLM API calls because your Redis cache fails when a user asks \"How do I reset my password?\" instead of \"Password reset steps.\" In 2026, relying on exact-string matching for LLM caching is a rookie mistake that kills both your latency and your budget.\n\n## Why Most Developers Get This Wrong\n\n * **Exact-Match Obsession:** Using traditional Redis or Memcached key-value pairs, which completely misses semantically identical queries with different wordings.\n * **Database Abuse:** Hand-rolling vector math inside the application layer instead of letting `pgvector` perform native, hardware-accelerated cosine distance queries.\n * **Network Bloat:** Calling external APIs (like OpenAI) to embed the user's query _before_ checking the cache, defeating the low-latency purpose of caching.\n\n\n\n## The Right Way\n\nIntercept LLM calls at the framework level using Spring AI Advisors paired with a local embedding model and a pgvector-backed similarity search.\n\n * **Use Spring AI Advisors:** Implement a custom `CallAroundAdvisor` to transparently intercept prompts before they hit the external LLM provider.\n * **Local Embeddings:** Use a local ONNX model (like `all-MiniLM-L6-v2`) inside your JVM process to generate query embeddings in under 5ms, avoiding external network hops.\n * **Cosine Distance Thresholding:** Query PostgreSQL using `pgvector` with an HNSW index, filtering results with a strict similarity threshold (e.g., `> 0.96`).\n\n\n\n## Show Me The Code\n\nHere is how to implement a high-performance, reusable semantic cache advisor using Spring AI:\n\n\n\n public class SemanticCacheAdvisor implements CallAroundAdvisor {\n private final PgVectorStore vectorStore;\n private final double similarityThreshold = 0.96;\n\n @Override\n public AdvisedResponse aroundCall(AdvisedRequest request, CallAroundAdvisorChain chain) {\n String query = request.getPrompt().getInstructions().get(0).getContent();\n var matches = vectorStore.similaritySearch(\n SearchRequest.query(query).withSimilarityThreshold(similarityThreshold).withTopK(1)\n );\n if (!matches.isEmpty()) {\n return AdvisedResponse.from(matches.get(0).getMetadata().get(\"cached_response\").toString());\n }\n AdvisedResponse response = chain.nextAroundCall(request);\n var cachedDoc = new Document(query, Map.of(\"cached_response\", response.getMessage()));\n vectorStore.add(List.of(cachedDoc));\n return response;\n }\n }\n\n\n## Key Takeaways\n\n * **Decouple Caching:** Keep your business logic clean; use Spring AI's `Advisor` chain to handle semantic caching transparently without polluting your services.\n * **Index for Scale:** Always create an HNSW index on your `pgvector` columns to maintain sub-10ms query times as your cache grows to millions of rows.\n * **Set Strict Thresholds:** Keep your similarity threshold high (0.95+) to prevent \"hallucinated\" cache hits where distinct user intents are incorrectly matched.\n\n\n\n> I built javalld.com while prepping for senior roles — complete LLD problems with execution traces, not just theory.",
"title": "Stop Wasting LLM Budgets: High-Performance Semantic Caching with Spring AI and pgvector"
}