{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifia6j4vosadydgvw2n6nfnrfvatewefdwndqf5lgehftnxmap5jy",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mk6ei3cm7442"
},
"path": "/t/densemem-folded-ram-tech/175488#post_1",
"publishedAt": "2026-04-23T15:49:25.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"GitHub - thorshammerztp-arch/densemem-protocol: DenseMem protocol package · GitHub"
],
"textContent": "Title: I built a KV cache compression protocol — 256x ratio, 0.9994 fidelity, running live on an RTX 4090\n\n* * *\n\nHey r/LocalLLaMA,\n\nI’ve been running a 72B model’s full KV cache in 640MB of DDR5 RAM on my RTX 4090 + Core i9. Wanted to share what I built.\n\n**DenseMem v0.2.0 — FoldedMemory Protocol**\n\nThe problem: a 72B model at 32K context needs ~160GB of KV cache. That’s H100 territory. Most of us can’t touch it.\n\nThe insight: KV cache activations aren’t random. They’re highly structured and correlated. SVD at rank=64 exploits that geometry. The compression is lossy in theory but in practice the fidelity holds at 0.9994 cosine similarity — because real transformer activations live in a low-dimensional subspace.\n\n**Live benchmark (RTX 4090 + Core i9 + DDR5):**\n\n * Compression: 256x\n * Fidelity: 0.9994 cosine similarity\n * Negative control (random noise): 0.12 — confirms it’s exploiting structure, not luck\n * Avg fetch latency: 1.95ms\n * Max fetch latency under load: 3.96ms\n * Evictions: 2,944 clean\n * 16,384 MB → 63.9 MB live test\n\n\n\n**Architecture:**\n\nTwo-tier hierarchy — VRAM hot, DDR5 warm. Attention-weighted eviction (0.5 attn + 0.3 recency + 0.2 freq). Prefetcher using layer lookahead + sequential token prediction. Two-method API: store() and fetch().\n\n**Current limitation:**\n\nHit rate is 25% — my i9’s 2-channel DDR5 is the bottleneck (~38 GB/s). On Threadripper PRO 8-channel DDR5 (~224 GB/s) I’m projecting 65-75% hit rate with sub-2ms latency.\n\n**Running live:**\n\nQwen2.5-7B at 32K context on a single 4090. Every tick compressed INT8 via PCA into DDR5. Context went from 4K to 32K — 8x expansion via DenseMem.\n\n**Cost:**\n\nUncompressed 72B KV cache at 32K ctx: $32,000 in HBM3e.\nFoldedMemory: $1.88 in DDR5.\n\nGitHub: GitHub - thorshammerztp-arch/densemem-protocol: DenseMem protocol package · GitHub Patent pending (US 64/045,595).\n\nHappy to answer questions on the compression math, architecture, or benchmark methodology.\n\n* * *\n\n_Built by a solo developer / Navy veteran on personal hardware. No funding._",
"title": "DenseMem-Folded Ram Tech"
}