Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidkfygosigjnqexotpmbtq6tgrwjfvzcl5xhuvb2lbzrtbqtsrdlq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mokm6tmwwf72"
  },
  "path": "/t/shannon-prime-lattice/176466?page=2#post_24",
  "publishedAt": "2026-06-18T09:06:40.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "## Prime-Engine + Prime-System - Update 2026-06-12 (XBAR P3: the KV cache learns to live on disk)\n\n#### **Quick frame for anyone following along: XBAR is the auditable-latent-crossbar line - giving the frozen gemma-4-12B Exec a disk-backed memory tier (Ring-2) so its KV history doesn’t all have to sit in VRAM. Every stage below is gated bit-exact against the full-cache baseline on the real 12B, on the RTX 2060-12GB - the same discipline as the rest of this thread: bit-exact is table stakes, the envelope is the headline. Since the 06-10 tokenizer update the whole P3 substrate landed.**\n\n**Read-path - write a memory, save it to disk, reload it, attend over it (all token-identical):**\n\n  * _P3.0 -_ the episode/owner-map manifest, gate G-P3-0: the byte-addressing law (off[L]) for where each layer’s KV lives in the store (system 9a2b0a9).\n  * _P3.1 -_ that off[L] indirection wired into the gemma4 CUDA decode; gate G-P3-1, the recalled sequence token-identical to the legacy live-cache decode (engine cdb4e1d).\n  * _P3.1b-1_ - serialize the store to disk → deserialize → mount → decode == legacy; gate G-P3-1b (engine 7c383bc).\n  * _P3.1b-2 -_ mount a saved episode as prepended history and keep generating over it; gate G-P3-1b-2, continuation == the monolithic decode, diffs=0 (engine b7fc3ee). The read-path is complete: history that was written, saved to disk, and reloaded drives generation with zero drift.\n\n\n\n**Write-path - the inverse, also bit-exact:**\n\n  * **P3.2-a -** shadow spill:* per step, each layer’s freshly-minted K/V is written out to the Ring-2 disk store via the stdio backend; gate G-P3-R2.a reads it back byte-identical to the live cache (0 diffs at both a 5.2 MB and an 85.3 MB store, 48 owners, 0 sharer blocks) (engine 2864614).\n  * **P3.2-b-1 -** the closed loop: spill a position → poison (zero) the live copy → page it back off disk before attention; gate G-P3-R2.b-1, the paged decode token-identical to the full-cache decode. The poison is the rigor - the live cache is provably not the source, so this proves the model’s entire history can live on disk and still drive generation exactly (engine b516ec1).\n  * **P3.2-b-2a -** the first stage that actually shrinks VRAM: the 40 sliding-window layers (which carry the dominant cache term) drop from full-length to a fixed W-slot ring, with a position-ordered ring kernel that keeps it bit-exact to the full cache; gate G-P3-R2.b-2a (engine 1a08d3d).\n\n\n\nLattice contracts/state/handoff tracked alongside: e1ae5d9 → 27c7579, session bank 5fa2a9c.\n\n**What it means, briefly.** The 12B’s KV cache no longer has to live entirely in VRAM - it can spill to a disk tier and be recalled mid-generation without changing a single output token, and the sliding-window ring makes the biggest cache term constant in context length instead of linear.\n\n**Honest scope, same as always:** these are bit-exact substrate gates - correctness, not speed. The headline VRAM figure (the arithmetic: the dominant sliding-window term goes from ~21 GB at 32k context down to a constant ~0.67 GB at a 1024 window) is the consequence of the shrink, proven bit-exact at gate scale - it is not yet a measured end-to-end 32k run, and the per-step paging is correctness-first, not perf-tuned. And the one layer still scaling with context is the 8 global-attention layers - which is the next step, gate G-P3-R2.b-2b, and it deliberately crosses out of bit-exact territory: sparse top-k recall drops keys by definition, so the gate there stops being “diffs=0” and becomes a pre-registered degradation bound (PPL ceiling / retrieval retention). That’s the policy layer, drafted next.\n\n_For the folks in the neighbouring threads chasing “which concepts survive vs. get dropped” - this is the engine-side version of that exact question, just answered as a substrate (what stays resident vs. pages to disk and comes back) before any learned policy sits on top._",
  "title": "Shannon Prime Lattice"
}