External Publication
Visit Post

SVSK -Q quantization method

Hugging Face Forums [Unofficial] April 28, 2026
Source

Hello! I’m a newbie here, but I want to show my last project – SVSK (Structured Vector Sidecar).

NB! It seems I initially created a topic in the wrong section, I’ll delete the previous one in “newbies”.

It’s a post‑training quantization method that keeps a strong 4‑bit base and adds a tiny low‑rank sidecar (rank 8/16) to recover the most harmful quantization error.

Key numbers (Qwen3‑4B, Wikitext validation):

| Variant | ΔNLL (↓ better) | PPL ratio |

|---------|----------------|------------|

| SVSK r16 (dense restore) | 0.028 | 1.028 |

| Q4_K_M (llama.cpp) | 0.032 | 1.033 |

| Q4_K_XL (llama.cpp) | 0.036 | 1.037 |

-> SVSK has ~15% lower degradation than Q4_K_M in this test.

What makes it different?

  • Activation‑aware 4‑bit base (AA‑NativeQ4) – clips per channel.

  • Tile‑local low‑rank sidecar (U·V) stored in int8.

  • Total budget: 4.44 bpw (r8) / 4.6 bpw (r16) – not cheating with hidden 6‑bit.

  • No fine‑tuning, just calibration on 128 chunks of Wikitext.

Current status:

  • Offline quality better than Q4_K_M (on Qwen3‑4B).

  • Alpha runtime with Triton kernels – ~34 tok/s on RTX 4000.

  • No CUDA yet, not integrated into llama.cpp.

  • Not production‑ready.

What I need help with:

All that I need - your feedback! I need all of the meanings about it, usefull or useless - the answer is up to you:)

Full code, instructions and autotune script:

https://github.com/Dookoo2/SVSK

You can reproduce the PPL comparison in about 1 or 2 hours - I tried to write good README with “step by step” guide.

Thanks for reading! Any feedback is welcome.

Discussion in the ATmosphere

Loading comments...