Shannon Prime Lattice
Prime-Engine - Updates
2. Current status — honest table
Update 2026-06-10 — gemma4 tokenizer dispatch ( #115) CLOSED; 12B text-in LIVE. New tokenizer module src/tokenizer/gemma4_bpe.c + family dispatch (GEMMA4_BPE family tag written into .sp-tokenizer by sp_transcode; unknown family = hard error). Gates T_G4_TOK_PARITY + T_G4_TOK_ROUNDTRIP: 5432/5432 HF-parity exact, both lanes (GGUF lane + .sp-tokenizer blob lane; engine 3253a82, core 9d3cc72). Deployment (engine d8ba947): the installed 12B blobs were regenerated via sp_transcode --tok-only and each paired .sp-model header SHA re-paired the way sp_transcode pairs at creation; new gate T_G4_TOK_12B_PAIRED (proven sensitive — a legacy type_id=2 blob fails it 0/5432); B1 GPU decode smoke 6/6 on the 2060. Also: the E_CPU_9 byte-identity lanes now pin SP_CPU_SCALAR=1 (the common AVX2 dot kernel reassociates; engine 5cd5870), and the submodule carries core 64b698c — the sp_arm_*_geom per-layer-class router API (T_ARM_GEOM 26/26), the G-P3-GEOM substrate for the gemma4 ring port.
Update 2026-06-08 — the gemma-4 campaign closed; the sovereign quantization pipeline ships here. sp_transcode gained Safetensors Direct (--st <model.safetensors>: weight VALUES from the official checkpoint; GGUF supplies verified-clean metadata/tokenizer only; mapped-but-missing = hard error) and the OK_Q4B codec (--q4b / --q4b-ffn recipe B1: per-32-block f16 scales, store-then-derive). The CUDA backend gained k_gemv_q4b_dp4a_v2 (per-block scale inside the dp4a chunk loop) + k_dequant_arena_q4b + DevTensor.bscale routing; the core arena moved to layout v2 (formal migration). Result, gated 24/24 on the RTX 2060 12GB: Gemma-4-12B at 26.1 tok/s and wikitext PPL 5.12 (GPU PPL gate 5.1160 vs the hand-written gold reference 4.6776; sim/CPU/GPU triple-agreement at 5.1259/5.1259/5.1160). Context: every gemma-4 GGUF measurable in June 2026 carries broken weights (192–506 by engine-independent measurement) — see the public repo’s GEMMA4-QUANT-FIX.md. The earlier 34.2 tok/s headline is retired (its artifact failed the PPL gate).
Update 2026-06-06. New since the snapshot below: the engine drives the canonical math-core decode at engine speed via the cpu_overlay.c dispatch seam (the duplicate decode was deleted); AVX2 sp_pr_resdot + sp_ntt_fwd_batch (lanes=heads) + AVX512-VPOPCNTDQ sp_arm_scan_sig overrides; the dual-size + split-device Optane Ring-2 store (ring2_arm_backend.c, SP_RING2_OPTANE_DIR_V) with read_batch2 concurrent dual-queue fetch and a bounded LRU temporal staging cache (SP_RING2_CACHE_MB); the QUIC Ring-2 peer + two-process showpiece (sp_ring2_showpiece). CUDA backend (RTX 2060 sm_75): gated on real silicon — prefill qwen3_forward_cuda f32+Q8 argmax-exact, and a NEW autoregressive qwen3_decode_cuda (KV resident in VRAM, device argmax; gate M_QWEN3_DECODE_CUDA) generating at 6.93→11.97 tok/s (Q8). Detail in the lattice papers/PPT-LAT-Roadmap.md §21 + SESSION-CLOSED-stage-beta-s0.md.
Prime System Updates
2. Current status
Update 2026-06-10. New since 06-08: the per-layer-class geom API (sp_arm_*_geom, commits d118a92 + 64b698c — gate T_ARM_GEOM 26/26, uniform-null bit-identical, legacy entry points delegate to the geom bodies; the G-P3-GEOM substrate for the gemma4 ring port). C1-lite COMPLETE (tag xbar-c1-lite-complete): the transactional curator core + episode persistence / router re-projection (tools/curator/), the SP_REPLAY replay-decode seam in decode.c (T_GENKV_REPLAY_NULL 34/34), recall-hit telemetry (sp_arm_hits_*) and cold-evict consolidation (T_GENKV_COLD_EVICT 45/45). Gemma4 tokenizer dispatch, core side (SP_TOK_GEMMA4_BPE family tag + vocab-only GGUF open fix, 9d3cc72; the engine carries the gemma4_bpe module + gates). P3 pre-flight hardening: gemma4 shared-KV owner-index bounds guard + standalone frobenius link fix + T_FRO_5 aligned to arena layout v2 (c608b2f). The two-ring / replay / cold-evict gate harness lives in core/session/arm_genkv_gate.c.
Update 2026-06-08. The packed-weight arena moved to layout v2 (formal migration: core/arena/arena.c pin + sp/frobenius_lift.h v2 note): the descriptor gains optional per-32-block f16 scales (bscale/bs_nblk) for the OK_Q4B codec — bscale == NULL preserves v1 semantics exactly (all producers audited zero-init). Migrated consumers: sp_frob_packed_dequant_row + matmul_arena per-block paths; new bridge builder build_packed_q4b (.bscale sibling; dtypes 13/14 in sp/sp_model.h). This format carries the gemma-4-12B sovereign artifact (GPU-gated PPL 5.12 vs the gold reference 4.6776 — lattice CONTRACT-SPEED + the public GEMMA4-QUANT-FIX.md).
Update 2026-06-06. The math-core now owns the canonical two-ring KV decode (core/arm/ + core/forward/decode.c — the single-source qwen3_generate_kv/qwen3_ppl_decode). New since the snapshot below: dual-prime NTT keystore fusion (core/poly_ring/, write-once residue cache, exact <q,k> via residue dot + Garner — bit-exact to the scalar reference, gates T_PR_KSTORE/_BLUE/_RESDOT/_BATCH); the bit-packed popcount router (sp_arm_project_sig/select_sig + the sp_arm_scan_sig AVX512-overridable seam, gate T_ARM_SIG); GQA group-centroid kv-head selection; the batched forward-NTT seam (sp_ntt_fwd_batch + ntt_fwd_plan view). The Ring-2 abstract backend gained read_batch2 (mixed-stream concurrent fetch). All overlay knobs are off-by-default and bit-identical when off. Suite 22/22. Detail in the lattice papers/CONTRACT-C2/CONTRACT-SPEED.
GPU acceleration of this core’s packed arena (engine-side, 2026-06-06). The math-core’s Q8/Q4 packed-weight arena (core/frobenius/ + the codec) is now consumed directly on the GPU by a fused __dp4a GEMV in the engine’s CUDA decode — 1 byte/weight (Q8) / 0.5 byte/weight (Q4) straight from VRAM, no f32 dequant. Isolated on an RTX 2060 at 12B-scale dims: f32 1× (bus-saturated ~290 GB/s) → int8 ~3.8× → Q4 ~7.06× , all top-1-lossless vs the core’s dequant reference. This is the discrete-substrate payoff at deployment scale: the packed weights aren’t just smaller on disk, they’re ~7× faster to compute where the memory bus binds. See shannon-prime-system-engine README §5.2.1 + lattice SESSION-CLOSED-stage-beta-speed.md. GPU benchmarking discipline is in CONVENTIONS.md.
Sync discipline: this repo is ALSO carried as the engine’s lib/shannon-prime-system submodule, so the two checkouts can diverge. Both track the same origin/main; git fetch + behind-check before any build or commit (see CLAUDE.md), and every standalone commit is followed by a submodule bump in the engine.
Headline (what the math-core now proves). The discrete forward is bit-exact on 5 arch families (through the 35B-A3B Gated-DeltaNet MoE); the reducing .sp-model codec is output-lossless and smaller than source (C1); the NTT-CRT / Frobenius / Spinor / KSTE primitives are all shipped + gated. These primitives feed the engine’s measured envelope — the two-ring memory (910× @32k, 7.57 µs/read off Optane) and the WIRE-CPU integer pipe (0.84 → 39.52 tok/s, 47×) are realized in the engine repo on top of this core. The open headline remains the Spinor per-vector KV codec ratio at bit-exact (lossy 29/31 today) — see shannon-prime-lattice/papers/PPT-LAT-STATE.md.
Discussion in the ATmosphere