Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifia6j4vosadydgvw2n6nfnrfvatewefdwndqf5lgehftnxmap5jy",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mk6ei3cm7442"
  },
  "path": "/t/densemem-folded-ram-tech/175488#post_1",
  "publishedAt": "2026-04-23T15:49:25.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub - thorshammerztp-arch/densemem-protocol: DenseMem protocol package · GitHub"
  ],
  "textContent": "Title: I built a KV cache compression protocol — 256x ratio, 0.9994 fidelity, running live on an RTX 4090\n\n* * *\n\nHey r/LocalLLaMA,\n\nI’ve been running a 72B model’s full KV cache in 640MB of DDR5 RAM on my RTX 4090 + Core i9. Wanted to share what I built.\n\n**DenseMem v0.2.0 — FoldedMemory Protocol**\n\nThe problem: a 72B model at 32K context needs ~160GB of KV cache. That’s H100 territory. Most of us can’t touch it.\n\nThe insight: KV cache activations aren’t random. They’re highly structured and correlated. SVD at rank=64 exploits that geometry. The compression is lossy in theory but in practice the fidelity holds at 0.9994 cosine similarity — because real transformer activations live in a low-dimensional subspace.\n\n**Live benchmark (RTX 4090 + Core i9 + DDR5):**\n\n  * Compression: 256x\n  * Fidelity: 0.9994 cosine similarity\n  * Negative control (random noise): 0.12 — confirms it’s exploiting structure, not luck\n  * Avg fetch latency: 1.95ms\n  * Max fetch latency under load: 3.96ms\n  * Evictions: 2,944 clean\n  * 16,384 MB → 63.9 MB live test\n\n\n\n**Architecture:**\n\nTwo-tier hierarchy — VRAM hot, DDR5 warm. Attention-weighted eviction (0.5 attn + 0.3 recency + 0.2 freq). Prefetcher using layer lookahead + sequential token prediction. Two-method API: store() and fetch().\n\n**Current limitation:**\n\nHit rate is 25% — my i9’s 2-channel DDR5 is the bottleneck (~38 GB/s). On Threadripper PRO 8-channel DDR5 (~224 GB/s) I’m projecting 65-75% hit rate with sub-2ms latency.\n\n**Running live:**\n\nQwen2.5-7B at 32K context on a single 4090. Every tick compressed INT8 via PCA into DDR5. Context went from 4K to 32K — 8x expansion via DenseMem.\n\n**Cost:**\n\nUncompressed 72B KV cache at 32K ctx: $32,000 in HBM3e.\nFoldedMemory: $1.88 in DDR5.\n\nGitHub: GitHub - thorshammerztp-arch/densemem-protocol: DenseMem protocol package · GitHub Patent pending (US 64/045,595).\n\nHappy to answer questions on the compression math, architecture, or benchmark methodology.\n\n* * *\n\n_Built by a solo developer / Navy veteran on personal hardware. No funding._",
  "title": "DenseMem-Folded Ram Tech"
}