Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihjvsewccqzxg3otp4asrhl4eeugbhjqn7bzou4kq4zjr5i3xq5v4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mo44slrrtjn2"
  },
  "path": "/t/shannon-prime-lattice/176466#post_13",
  "publishedAt": "2026-06-12T14:36:32.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "@32k"
  ],
  "textContent": "# Prime-Engine - Updates\n\n## **2. Current status — honest table**\n\n**Update 2026-06-10 — gemma4 tokenizer dispatch ( #115) CLOSED; 12B text-in LIVE.** New tokenizer module `src/tokenizer/gemma4_bpe.c` + family dispatch (`GEMMA4_BPE` family tag written into `.sp-tokenizer` by `sp_transcode`; unknown family = hard error). Gates `T_G4_TOK_PARITY` + `T_G4_TOK_ROUNDTRIP`: **5432/5432 HF-parity exact, both lanes** (GGUF lane + `.sp-tokenizer` blob lane; engine `3253a82`, core `9d3cc72`). Deployment (engine `d8ba947`): the installed 12B blobs were regenerated via `sp_transcode --tok-only` and each paired `.sp-model` header SHA re-paired the way `sp_transcode` pairs at creation; new gate `T_G4_TOK_12B_PAIRED` (proven sensitive — a legacy type_id=2 blob fails it 0/5432); B1 GPU decode smoke 6/6 on the 2060. Also: the E_CPU_9 byte-identity lanes now pin `SP_CPU_SCALAR=1` (the common AVX2 dot kernel reassociates; engine `5cd5870`), and the submodule carries core `64b698c` — the **`sp_arm_*_geom` per-layer-class router API** (`T_ARM_GEOM` 26/26), the G-P3-GEOM substrate for the gemma4 ring port.\n\n**Update 2026-06-08 — the gemma-4 campaign closed; the sovereign quantization pipeline ships here.** `sp_transcode` gained **Safetensors Direct** (`--st <model.safetensors>`: weight VALUES from the official checkpoint; GGUF supplies verified-clean metadata/tokenizer only; mapped-but-missing = hard error) and the **OK_Q4B** codec (`--q4b` / `--q4b-ffn` recipe B1: per-32-block f16 scales, store-then-derive). The CUDA backend gained `k_gemv_q4b_dp4a_v2` (per-block scale inside the dp4a chunk loop) + `k_dequant_arena_q4b` + `DevTensor.bscale` routing; the core arena moved to layout v2 (formal migration). Result, gated 24/24 on the RTX 2060 12GB: **Gemma-4-12B at 26.1 tok/s and wikitext PPL 5.12** (GPU PPL gate 5.1160 vs the hand-written gold reference 4.6776; sim/CPU/GPU triple-agreement at 5.1259/5.1259/5.1160). Context: every gemma-4 GGUF measurable in June 2026 carries broken weights (192–506 by engine-independent measurement) — see the public repo’s `GEMMA4-QUANT-FIX.md`. The earlier 34.2 tok/s headline is retired (its artifact failed the PPL gate).\n\n**Update 2026-06-06.** New since the snapshot below: the engine drives the canonical math-core decode at engine speed via the `cpu_overlay.c` dispatch seam (the duplicate decode was deleted); AVX2 `sp_pr_resdot` + `sp_ntt_fwd_batch` (lanes=heads) + AVX512-VPOPCNTDQ `sp_arm_scan_sig` overrides; the dual-size + **split-device** Optane Ring-2 store (`ring2_arm_backend.c`, `SP_RING2_OPTANE_DIR_V`) with `read_batch2` concurrent dual-queue fetch and a bounded LRU temporal staging cache (`SP_RING2_CACHE_MB`); the QUIC Ring-2 peer + two-process showpiece (`sp_ring2_showpiece`). **CUDA backend (RTX 2060 sm_75): gated on real silicon** — prefill `qwen3_forward_cuda` f32+Q8 argmax-exact, and a NEW autoregressive **`qwen3_decode_cuda`** (KV resident in VRAM, device argmax; gate `M_QWEN3_DECODE_CUDA`) generating at 6.93→11.97 tok/s (Q8). Detail in the lattice `papers/PPT-LAT-Roadmap.md` §21 + `SESSION-CLOSED-stage-beta-s0.md`.\n\n# Prime System Updates\n\n## **2. Current status**\n\n**Update 2026-06-10.** New since 06-08: the **per-layer-class geom API** (`sp_arm_*_geom`, commits `d118a92` + `64b698c` — gate `T_ARM_GEOM` 26/26, uniform-null bit-identical, legacy entry points delegate to the geom bodies; the G-P3-GEOM substrate for the gemma4 ring port). **C1-lite COMPLETE** (tag `xbar-c1-lite-complete`): the transactional curator core + episode persistence / router re-projection (`tools/curator/`), the `SP_REPLAY` replay-decode seam in `decode.c` (`T_GENKV_REPLAY_NULL` 34/34), recall-hit telemetry (`sp_arm_hits_*`) and cold-evict consolidation (`T_GENKV_COLD_EVICT` 45/45). Gemma4 tokenizer dispatch, core side (`SP_TOK_GEMMA4_BPE` family tag + vocab-only GGUF open fix, `9d3cc72`; the engine carries the `gemma4_bpe` module + gates). P3 pre-flight hardening: gemma4 shared-KV owner-index bounds guard + standalone frobenius link fix + `T_FRO_5` aligned to arena layout v2 (`c608b2f`). The two-ring / replay / cold-evict gate harness lives in `core/session/arm_genkv_gate.c`.\n\n**Update 2026-06-08.** The packed-weight arena moved to **layout v2** (formal migration: `core/arena/arena.c` pin + `sp/frobenius_lift.h` v2 note): the descriptor gains optional per-32-block f16 scales (`bscale`/`bs_nblk`) for the **OK_Q4B** codec — `bscale == NULL` preserves v1 semantics exactly (all producers audited zero-init). Migrated consumers: `sp_frob_packed_dequant_row` + `matmul_arena` per-block paths; new bridge builder `build_packed_q4b` (`.bscale` sibling; dtypes 13/14 in `sp/sp_model.h`). This format carries the gemma-4-12B sovereign artifact (GPU-gated PPL 5.12 vs the gold reference 4.6776 — lattice CONTRACT-SPEED + the public `GEMMA4-QUANT-FIX.md`).\n\n**Update 2026-06-06.** The math-core now owns the canonical two-ring KV decode (`core/arm/` + `core/forward/decode.c` — the single-source `qwen3_generate_kv`/`qwen3_ppl_decode`). New since the snapshot below: dual-prime NTT keystore fusion (`core/poly_ring/`, write-once residue cache, exact `<q,k>` via residue dot + Garner — bit-exact to the scalar reference, gates `T_PR_KSTORE`/`_BLUE`/`_RESDOT`/`_BATCH`); the bit-packed popcount router (`sp_arm_project_sig`/`select_sig` + the `sp_arm_scan_sig` AVX512-overridable seam, gate `T_ARM_SIG`); GQA group-centroid kv-head selection; the batched forward-NTT seam (`sp_ntt_fwd_batch` + `ntt_fwd_plan` view). The Ring-2 abstract backend gained `read_batch2` (mixed-stream concurrent fetch). All overlay knobs are off-by-default and bit-identical when off. Suite 22/22. Detail in the lattice `papers/CONTRACT-C2`/`CONTRACT-SPEED`.\n\n**GPU acceleration of this core’s packed arena (engine-side, 2026-06-06).** The math-core’s Q8/Q4 packed-weight arena (`core/frobenius/` + the codec) is now consumed directly on the GPU by a fused `__dp4a` GEMV in the engine’s CUDA decode — 1 byte/weight (Q8) / 0.5 byte/weight (Q4) straight from VRAM, no f32 dequant. Isolated on an RTX 2060 at 12B-scale dims: **f32 1× (bus-saturated ~290 GB/s) → int8 ~3.8× → Q4 ~7.06×** , all top-1-lossless vs the core’s dequant reference. This is the discrete-substrate payoff at deployment scale: the packed weights aren’t just smaller on disk, they’re ~7× faster to _compute_ where the memory bus binds. See `shannon-prime-system-engine` README §5.2.1 + lattice `SESSION-CLOSED-stage-beta-speed.md`. GPU benchmarking discipline is in `CONVENTIONS.md`.\n\nSync discipline: this repo is ALSO carried as the engine’s `lib/shannon-prime-system` submodule, so the two checkouts can diverge. Both track the same `origin/main`; `git fetch` + behind-check before any build or commit (see `CLAUDE.md`), and every standalone commit is followed by a submodule bump in the engine.\n\n**Headline (what the math-core now proves).** The discrete forward is bit-exact on **5 arch families** (through the 35B-A3B Gated-DeltaNet MoE); the reducing `.sp-model` codec is **output-lossless and smaller than source** (C1); the NTT-CRT / Frobenius / Spinor / KSTE primitives are all shipped + gated. These primitives feed the engine’s measured envelope — the two-ring memory (910× @32k, 7.57 µs/read off Optane) and the WIRE-CPU integer pipe (0.84 → 39.52 tok/s, 47×) are realized in the engine repo on top of this core. The open headline remains the **Spinor per-vector KV codec ratio at bit-exact** (lossy 29/31 today) — see `shannon-prime-lattice/papers/PPT-LAT-STATE.md`.",
  "title": "Shannon Prime Lattice"
}