Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicvigohq2yf3yjwq3hvakdbab25xj254jrtz427rxp74bhs4kjopu",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mp7ptieaull2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreiepqbtwm6s57y7qzyawjns3ryzhu3ucfnt4tdiyfxtktwsatoumga"
    },
    "mimeType": "image/webp",
    "size": 66918
  },
  "path": "/arjunkshah/how-i-built-a-prompt-compressor-that-saves-65-on-llm-costs-3m80",
  "publishedAt": "2026-06-26T19:15:11.000Z",
  "site": "https://dev.to",
  "tags": [
    "ai",
    "llm",
    "opensource",
    "python",
    "supercompress.vercel.app",
    "https://github.com/arjunkshah/supercompress",
    "https://supercompress.vercel.app",
    "https://arjunkshah-supercompress-55.mintlify.app"
  ],
  "textContent": "#  How I Built a Prompt Compressor That Saves 65% on LLM Costs\n\nEvery time you call an LLM, tokens that never needed to be processed burn GPU cycles, waste money, and strain the grid. The problem gets worse with every agent loop, every long-context RAG query, every multi-turn conversation.\n\nI built **SuperCompress** — a tiny ~5K parameter CPU policy that scores every line of context for relevance before inference, keeping only what the model needs.\n\n**The results?** 65% fewer tokens, 100% oracle recall, ~60ms latency. Open source. MIT licensed.\n\n##  The Problem: LLMs Are Wasteful\n\nModern LLMs process every token you give them. On long contexts (think agent logs, RAG results, codebases), most of those tokens are padding — irrelevant boilerplate that consumes KV cache space without contributing to the answer.\n\nThe standard approaches don't work well:\n\nApproach | Tokens Saved | Answer Quality\n---|---|---\nTruncation (keep head/tail) | ~65% | ~25% recall\nFIFO eviction | ~65% | ~25% recall\nH2O | ~65% | ~98% recall\n**SuperCompress** | **~65%** | **100% recall**\n\nAt the same KV savings, SuperCompress preserves answer quality dramatically better.\n\n##  The Architecture: CPU-First Eviction\n\nThe key insight: **you don't need a GPU to decide what a GPU should process.**\n\n\n\n    ┌─────────────┐     ┌──────────────┐     ┌──────────┐\n    │  Context In  │ ──→ │  CPU Policy  │ ──→ │  GPU LLM │\n    │ (1,247 tok)  │     │  (5K params) │     │ (437 tok) │\n    └─────────────┘     └──────────────┘     └──────────┘\n                              │\n                              ↓\n                        Score each line\n                        Drop low-relevance\n                        Keep answer-critical\n\n\nThe policy is a lightweight neural network (~5,000 parameters) that runs entirely on CPU. It takes each line of context + the user's question, and scores how relevant that line is to answering the question. Lines below a threshold get evicted.\n\n##  Training Approach\n\nThe policy was trained on a dataset of:\n\n  * Long-form text passages (books, documentation, code)\n  * Paired with realistic user questions\n  * Ground-truth relevance labels from oracle LLM judgments\n\n\n\nThe training objective balances:\n\n  1. **Token savings** — maximize KV reduction\n  2. **Recall** — preserve lines needed for correct answers\n  3. **Latency** — keep inference under 100ms on CPU\n\n\n\n##  Benchmarks\n\nAt a fixed 35% budget (keep 35% of tokens):\n\n\n\n    Policy          | Oracle Recall | Entity Recall | Latency\n    ────────────────┼───────────────┼───────────────┼────────\n    FIFO/Truncation |         25%  |         73%   | ~57ms\n    Summarization   |         61%  |         65%   | ~63ms\n    H2O             |         98%  |         73%   | ~56ms\n    SuperCompress   |        100%  |         73%   | ~60ms\n\n\n100% oracle recall means the policy never dropped a line that the answer depended on. At the same compute savings.\n\n##  Environmental Impact\n\nPer 1 million compressions:\n\n  * **800M tokens avoided** — that's real GPU time\n  * **29 kWh saved** — enough to power a home for a day\n  * **12 kg CO₂ avoided** — tiny but it adds up\n  * **52 L water saved** — datacenter cooling is thirsty\n\n\n\n##  Getting Started\n\n###  Python (in-process)\n\n\n    pip install git+https://github.com/arjunkshah/supercompress.git\n\n    from supercompress import compress_context\n\n    result = compress_context(\n        \"Your long context text here...\",\n        \"What does this code do?\",\n        budget_ratio=0.35,\n    )\n    print(result.compressed_text)\n    print(f\"{result.kv_savings_pct:.1f}% KV saved\")\n\n\n###  Hosted API (no local ML deps)\n\n\n    curl -X POST https://supercompress.vercel.app/api/v1/compress \\\n      -H \"X-API-Key: sc_live_YOUR_KEY\" \\\n      -d '{\"context\":\"...\",\"query\":\"Summarize this\"}'\n\n\n###  Browser demo (no setup needed)\n\nJust visit supercompress.vercel.app and try the live demo.\n\n##  What's Next\n\n  * Adaptive compression ratios (not fixed budget)\n  * Integration with LangChain/LlamaIndex as a built-in compressor\n  * Quantized policy for even lower latency\n\n\n\nThe code is open source under MIT. Contributions welcome!\n\n**GitHub:** https://github.com/arjunkshah/supercompress\n**Live demo:** https://supercompress.vercel.app\n**Docs:** https://arjunkshah-supercompress-55.mintlify.app",
  "title": "How I Built a Prompt Compressor That Saves 65% on LLM Costs"
}