Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreig3fyeuxzntss45nwolwifnjfdvqelrrgtwdn5lrjszrmwu2sg4zm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlqp5npazzz2"
  },
  "path": "/t/i-measured-360-configs-quantization-often-costs-energy-below-the-crossover-point/175979#post_1",
  "publishedAt": "2026-05-13T15:22:54.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "https://hongping-zh.github.io/",
    "https://github.com/hongping-zh/ecocompute-ai",
    "https://doi.org/10.5281/zenodo.18900289",
    "https://hongping-zh.github.io/compare.html",
    "https://clawhub.ai/hongping-zh/ecocompute",
    "@misc"
  ],
  "textContent": "# I Measured 360+ Configurations — Quantization Often _Costs_ Energy Below the Crossover Point\n\n## TL;DR\n\nI ran NVML-instrumented inference benchmarks across **360+ configurations** on **four NVIDIA architectures** (T4 / RTX 4090D / RTX 5090 / A800, 1.1B-14B parameter range, FP16 / INT8 / NF4) and found two results that contradict common practice:\n\n  1. **`LLM.int8()` default** (mixed-precision decomposition) increases inference energy by **+80% to +162%** vs FP16 on RTX 4090D. Throughput drops 68-75%. Accuracy is essentially preserved (+0.7% to +2% PPL on WikiText-2).\n  2. **NF4 has an architecture-specific crossover point** in the **~3.2B – 5.2B** parameter range. _Below_ it, NF4 increases energy by 25-55%. _Above_ it, NF4 saves 15-23%. The crossover moves with the GPU, not the model.\n\n\n\nFull dataset (Zenodo DOI 10.5281/zenodo.18900289) and an interactive comparison tool: https://hongping-zh.github.io/\n\n* * *\n\n## Why I bothered measuring\n\nThe folk-wisdom pipeline `quantize → fewer bits → less memory → less energy` treats energy as a monotone function of bit-width. That’s a proxy, not a measurement. Mixed-precision kernels, dequantization overhead, and reduced arithmetic intensity all push back. I wanted wall-clock NVML numbers, not FLOP counts.\n\n* * *\n\n## Setup (no embellishment)\n\n  * **Hardware:** Tesla T4 (Turing, 70 W) · RTX 4090D (Ada) · RTX 5090 (Blackwell) · A800 (Ampere, 300 W)\n  * **Models:** 1.1B – 14B parameter range, including Qwen2-7B, Qwen2.5-3B, Yi-1.5-6B (full list in the dataset)\n  * **Quantization:** FP16 (baseline) · `LLM.int8()` default · INT8 pure (`llm_int8_threshold=0.0`) · NF4 (bitsandbytes)\n  * **Power sampling:** NVML at 10 Hz\n  * **Repetitions:** n = 10 per configuration, 3 warmup runs discarded, coefficient of variation < 3%\n  * **Accuracy:** WikiText-2 test split, cross-entropy → perplexity\n  * **Software:** PyTorch 2.4+, bitsandbytes, transformers\n  * **What I did NOT measure:** Apple Silicon, Jetson, GPTQ, AWQ, GGUF. Those are open invitations — see the bottom.\n\n\n\n* * *\n\n## Finding 1 — `LLM.int8()` default is an energy regression on Ada\n\nOn RTX 4090D, the default `LLM.int8()` path (with outlier mixed-precision decomposition) consistently uses **more** energy than the FP16 baseline:\n\nMetric vs FP16 on RTX 4090D | `LLM.int8()` default\n---|---\nEnergy / request | **+80% to +162%**\nThroughput | **−68% to −75%**\nPerplexity (WikiText-2) | +0.7% to +2% (essentially preserved)\n\nThe decomposition path keeps a small set of activations in FP16 and ping-pongs through extra kernels. You pay accuracy-grade overhead for roughly nothing in return — at least on this architecture.\n\nIf you set `llm_int8_threshold=0.0` to _disable_ the decomposition you do recover energy savings, but at a real accuracy cost: **+2.5% to +22.7% WikiText-2 PPL** across the tested models. Qwen2.5-3B degrades 15.3%, Yi-1.5-6B 22.7%. So “INT8” isn’t a single thing — it’s at least two operating points with very different energy/accuracy trade-offs.\n\n* * *\n\n## Finding 2 — NF4 crossover moves with the GPU, not the model\n\nNF4 4-bit quantization isn’t uniformly green. There is a parameter-count threshold below which NF4 _increases_ energy, and that threshold shifts with the GPU architecture:\n\nGPU architecture | NF4 crossover (params) | INT8 crossover (params)\n---|---|---\nTuring (T4) | ~3.2 B | ~4.0 B\nAmpere (A800) | ~3.7 B | ~4.3 B\nAda (RTX 4090D) | ~3.9 B | ~4.6 B\nBlackwell (RTX 5090) | ~5.2 B | ~5.6 B\n\nAggregated across the dataset:\n\n  * **Below the crossover:** quantization adds **+25% to +55%** energy\n  * **Above the crossover:** quantization saves **−15% to −23%** energy\n\n\n\nConcrete supplemental case from this April: Qwen2.5-3B on T4, NF4 **increased** energy by **+7.4% to +39.9%** across batch sizes 1 / 2 / 4. 3B-on-Turing is squarely below the ~3.2 B crossover, and the data behaves as predicted.\n\nThe clean reading: **on newer architectures, the crossover moves up** , not down. The Blackwell memory subsystem makes FP16 cheaper relative to 4-bit dequant, which pushes the threshold from ~3.2 B (Turing) to ~5.2 B (Blackwell). Most of the small models people quantize today (1B–3B “edge-ready” Llama / Phi / Qwen) sit _below_ the crossover on every GPU I tested.\n\n* * *\n\n## What this means in practice\n\n  1. **“Quantize for energy” needs a parameter-count gate.** Below ~3-5 B on NVIDIA, FP16 is usually the greener choice.\n  2. **`LLM.int8()` ≠ INT8.** The default path on Ada is an energy regression. If you want INT8, decide explicitly between the decomposition path (preserves accuracy, pays energy) and `threshold=0.0` (saves energy, pays perplexity).\n  3. **Crossover thresholds are GPU-specific.** A config that wins on Ampere can lose on Blackwell. Re-measure when you swap fleet hardware.\n  4. **Latency is not energy.** Reporting tok/s in green-AI claims without wall-power data should be considered insufficient evidence.\n\n\n\n* * *\n\n## Methodology and reproducibility\n\n  * Raw CSVs, scripts, and per-row metadata: https://github.com/hongping-zh/ecocompute-ai\n  * Archived dataset (DOI): https://doi.org/10.5281/zenodo.18900289\n  * Interactive 2-4 model comparison tool: https://hongping-zh.github.io/compare.html\n  * ClawHub advisory skill (free, MIT) that quotes the matching benchmark row before any recommendation: https://clawhub.ai/hongping-zh/ecocompute\n\n\n\n\n    @misc{zhang2026llmenergy,\n      author = {Zhang, Hongping},\n      title  = {LLM Energy Benchmark: Real GPU Power Measurements for Quantized Inference},\n      year   = {2026},\n      doi    = {10.5281/zenodo.18900289},\n      url    = {https://doi.org/10.5281/zenodo.18900289},\n      note   = {NVML power monitoring, 4 GPU architectures, 360+ configurations, includes perplexity data}\n    }\n\n\n* * *\n\n## Open questions I’d love community input on\n\n  * Does the crossover pattern hold for **GPTQ / AWQ / GGUF k-quants**? My data only covers bitsandbytes NF4 and INT8.\n  * Does it hold on **Apple Silicon / Jetson / AMD / Intel**? I have not measured non-NVIDIA hardware, and the unified-memory dequant story may be qualitatively different.\n  * For a fixed model size, how much of the crossover shift between Ampere → Ada → Blackwell is memory bandwidth vs tensor-core throughput? My data can’t separate the two cleanly.\n\n\n\nPRs with measurements on additional hardware are welcome at https://github.com/hongping-zh/ecocompute-ai. I’m particularly interested in H100, H200, MI300X, and Apple M3/M4 numbers.\n\n-– Hongping",
  "title": "I measured 360+ configs — quantization often costs energy below the crossover point"
}