Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifsh2enlmd22d3edgiyoz3dxt2h5yhra6ghbxx2nrphg24ef6weoe",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnrfsfh67fl2"
  },
  "path": "/t/shannon-prime-lattice/176466#post_7",
  "publishedAt": "2026-06-08T08:00:40.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "https://github.com/nihilistau/Position_Is_Arithmetic",
    "https://github.com/nihilistau/Position_Is_Arithmetic/blob/main/GEMMA4-QUANT-FIX.md"
  ],
  "textContent": "**Update — the receipts-first paper series grew three papers, and one of them required indicting an ecosystem**\n\nA lot has happened since the opening post. The short version: the public,\nreceipts-first paper series at\n**https://github.com/nihilistau/Position_Is_Arithmetic** now carries papers\n**04, 05 and 06** — and finishing 06 forced us to root-cause something the\nwhole local-inference community is currently sitting on: every Gemma-4 GGUF\nwe could measure, including the post-fix rebuilds, carries broken weights.\n\nThe series discipline hasn’t changed: every number is a row in a shared\nledger with a command behind it, honest negatives stay on the record, and no\nthroughput number is citable without a quality gate on the same artifact.\nThat last rule is the reason this update exists.\n\n**Paper 04 — The Oracle & the Teacher** _(oracle-grounded backend verification)_\n\n  * _What it solves:_ porting a complex architecture to new silicon without the\nweeks-long divergence hunt — and, it turns out, defending yourself when the\nreference implementation itself is wrong.\n  * _How:_ extract a bit-faithful CPU oracle from the reference first (scalar,\nreadable, f64-accumulating), grade every backend against the oracle and\nnever against a prior port, and gate autoregressive decode by\nteacher-forcing (the oracle re-predicts the port’s own generated stream).\nReceipt: a 35-layer variable-geometry MatFormer (per-layer attention\nwidths, shared KV, proportional RoPE, softcap) matched its oracle at\n**max KL 2.663e-10** (argmax 12/12), both live runs green first-try, 38/38.\n  * _How it fits:_ this is the verification layer for everything in §1 of the\nopening post — “byte-exact, not small-KL” is only meaningful if the thing\nyou’re byte-exact _against_ is itself proven. The paper’s case study is the\nstrongest demonstration we have: when llama.cpp scored wikitext PPL 397–506\non Gemma-4-12B and the ecosystem normalized it, a from-scratch forward\nwritten off the official safetensors + config alone measured **4.6776** —\nthe model was healthy, llama.cpp’s forward was exonerated (two independent\nengines agree per-artifact), and the GGUF artifacts themselves were\nconvicted. An oracle is not a porting tool; it’s the only defense against a\npoisoned reference frame.\n\n\n\n**Paper 05 — The Probe Suite** _(bisection, isolation & benchmark hygiene as one set)_\n\n  * _What it solves:_ the fact that correct numbers about computing systems are\nnot read off — they are manufactured. The suite is how.\n  * _How:_ truncated-parity bisection, isolation sweeps, benchmark hygiene and\noracle-rank telemetry, used together. Documented kills: a 12.65× phantom\nspeedup (three stacked artifacts), a 2.8e-3 wrong-arithmetic localized in\ntwo probe runs, a mixed-precision 0/256 bug the isolated bench passed at\n1.34e-7, and a per-vector activation-quant collapse at oracle-rank 205,596\non outlier-heavy activations (fixed with per-block scales aligned to the\nkernel’s 128-bit loads).\n  * _How it fits:_ the second half turns the same toolset outward, at ecosystem\nscale — tensor-class swap bisection over the broken GGUFs (restoring just\nthe per-layer scale class recovered PPL 364→97; restoring norms made it\n_worse_ , proving the matmul weights damaged too), per-layer cosine\nforensics (no permutation; in-place damage with a period-6 layer\nsignature), and **simulate-before-build** : six quantization recipes\nsimulated through the proven reference forward before a line of CUDA\nexisted — and the built artifact then matched the simulation **to four\ndecimal places** (5.1259), with the GPU kernel agreeing as a third\ninstrument (5.1160).\n\n\n\n**Paper 06 — Computing on the Zip File** _(the dp4a bandwidth ladder — complete, gated, citable)_\n\n  * _What it solves:_ memory-bound decode on consumer silicon. The weights’\nbyte count is the speed of light, but only if you compute directly on the\npacked integer codes — dequantizing to f32 scratch first measured 3×\n_slower_ than plain f32.\n  * _How:_ warp-per-row `__dp4a` GEMV, 128-bit loads, in-ALU nibble unpack\n(~7% tax), exact integer accumulation, one Frobenius lift at the end —\nthe isolated ladder runs f32 1× → int8 ~3.8× → Q4 ~7.06×, hugging the\nbyte ratios. New this round: the **OK_Q4B** format (per-32-block f16\nscales, store-then-derive discipline) where one weight block is exactly\none 128-bit chunk in the kernel — zero extra code-bus traffic — and the\nsovereign quantization pipeline: artifact values come from the official\nsafetensors checkpoint, never from a GGUF, and every artifact gates\nagainst the paper-04 oracle before any throughput number is taken.\n  * _The headline, stated honestly:_ **Gemma-4-12B at 26.1 tok/s and wikitext\nPPL 5.12 on an RTX 2060 12GB** (graph path bit-exact, decode 256/256\ntop-1, 24/24 gates, clocks pinned). llama.cpp-CUDA on the same card does\n31.29 tok/s — at PPL 192–506, because its artifacts are broken.\nEngine-for-engine we move +18% more bytes/s (245 vs 207 GB/s effective);\nour artifact is heavier because it is the only mathematically intact\n4-bit Gemma-4-12B in existence. And in the spirit of the series: an\nearlier 34.2 tok/s headline is formally **retired** in the ledger —\nit was measured on an artifact that later failed the PPL gate. The rule\ncaught our own number first.\n\n\n\n**For anyone hitting the Gemma-4 quant weirdness themselves:** we published a\nstandalone walkthrough — verify the breakage in ~30 minutes with an\nengine-independent method, plus the quantization recipe that actually works\non this PTQ-hostile model (blanket 4-bit costs +45% PPL; 4-bit on the FFN\ngate/up pair only with 8-bit elsewhere costs +9.6%):\nhttps://github.com/nihilistau/Position_Is_Arithmetic/blob/main/GEMMA4-QUANT-FIX.md\nAll forensic instruments are MIT, ~130-line numpy/torch scripts, no GPU\nrequired for the verification.\n\nHow this fits the lattice overall: the opening post’s thesis was that\nfloating-point drift and un-provable identity are entropy bleeding into the\nhardware, and that a discrete substrate makes correctness a property you\nprove rather than estimate. This round extended that doctrine one level up\nthe stack — to the _artifacts_. The same discipline that makes a kernel\nbyte-exact (oracle, gates, receipts) is what caught an interchange format\nsilently destroying weights while every smoke test stayed green. The\nsupply chain is now part of the math.\n\nPapers, ledger, methodology, instruments:\n**nihilistau/Position_Is_Arithmetic**\n\nAs always — the unflattering numbers are kept attached on purpose.",
  "title": "Shannon Prime Lattice"
}