Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidggnosqu4cgxqq6hi4qkbrbuwn3mpi75ko5brcj5jkdp4funm3zu",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkahnbeql3l2"
  },
  "path": "/t/gauss-llm-checkpoint-compression-via-gmm-ans/175525#post_1",
  "publishedAt": "2026-04-24T10:36:32.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub - skystarry-ai/gauss: GAUSS: Distribution-Aware Compression for Neural Network Weights · GitHub",
    "GAUSS: Distribution-Aware Compression for Neural Network Weights"
  ],
  "textContent": "Sharing a project I’ve been working on: a compression scheme for LLM checkpoints based on GMM + ANS.\n\n**Motivation**\n\nMoving fine-tuned models around is expensive. A typical SFT checkpoint sits at 1.5–2 GB in safetensors format, and at GCS (inter-region) egress pricing ($0.12/GB), each download costs ~$0.20. At 1,000 downloads/month per model version that adds up fast — and optimizer states push a single training checkpoint past 5 GB. General-purpose compressors (gzip, zstd) barely work on float32 weights since they target byte-level patterns, and quantization methods change the computational graph. Gauss is designed for the archival/transfer case: compress once, recover the original dtype on demand.\n\n**What it does**\n\nTrained LLM weight tensors have strongly multi-modal value distributions — weights in a given layer cluster around a small number of modes. Gauss exploits this by fitting a K=16 Gaussian Mixture Model to each tensor, then entropy-coding the cluster assignments and residuals independently via ANS (Asymmetric Numeral Systems).\n\nThe pipeline per tensor:\n\n  1. Hard EM to fit GMM (subsampled to 200k elements for large tensors)\n  2. Assign every weight to its MAP cluster\n  3. Quantize the residual: `r = clip(round((w - μ) × S), -32767, 32767)`\n  4. ANS-encode two streams separately: index stream (Categorical) + residual stream (QuantizedGaussian)\n\n\n\nDecompression is exact: `ŵ = μ_cluster + r / S`, cast back to original dtype.\n\n**Results**\n\nOn a 24-layer SFT checkpoint (float32, 1645 MB):\n\n\n    1,645 MB → 335 MB   (4.90×)   max error ±5×10⁻⁴\n    Compression: 355s (2 workers, Colab CPU)\n    Decompression: 95s\n\n\nOn bf16 models, an adaptive scale cap kicks in (S drops from 1000 → 128) to avoid encoding sub-epsilon noise, giving ~3.3× in practice.\n\nPer-layer breakdown: o_proj compresses best (5.3–6.5×), embedding tables worst (4.09×). The variance is meaningful and consistent with the interpretation that output projections have the most concentrated distributions.\n\n**Comparison**\n\nMethod | Ratio | Lossless? | Error\n---|---|---|---\ngzip / zstd | ~1.02–1.03× | Yes | -–\nINT8 quantization | 4.00× | No | O(10⁻³)\nGauss | 4.90× | No | ±5×10⁻⁴\nINT4 quantization | 8.00× | No | O(10⁻²)\n\nFor bf16, the ±5×10⁻⁴ error bound is well within bf16’s own precision limit (±3.9×10⁻³), so functionally it doesn’t matter.\n\n**Limitations (honest)**\n\n  * Lossy — not suitable if you need bit-exact reproduction\n  * Tensors processed independently, no cross-layer structure exploited\n  * ANS encoding within each shard is sequential\n\n\n\n**A note on`gauss info` output**\n\nThe `info` command computes the original size from tensor shape metadata rather than reading the actual file bytes. This means the displayed ratio can differ from what `compress` reports — particularly for bf16/fp16 models, where the stored dtype is 2 bytes per element but a miscalculation would show 2× inflated ratios. If the numbers look surprisingly high in `info`, cross-check against the actual file sizes from `compress`.\n\n**Links**\n\n  * GitHub: GitHub - skystarry-ai/gauss: GAUSS: Distribution-Aware Compression for Neural Network Weights · GitHub\n  * `pip install gauss-compress`\n  * Technical report: GAUSS: Distribution-Aware Compression for Neural Network Weights\n\n\n\nCurious if anyone has tried similar approaches or has thoughts on the adaptive K direction.\n\n* * *",
  "title": "Gauss: LLM checkpoint compression via GMM + ANS"
}