External Publication
Visit Post

I measured 360+ configs — quantization often costs energy below the crossover point

Hugging Face Forums [Unofficial] May 13, 2026
Source

I Measured 360+ Configurations — Quantization Often Costs Energy Below the Crossover Point

TL;DR

I ran NVML-instrumented inference benchmarks across 360+ configurations on four NVIDIA architectures (T4 / RTX 4090D / RTX 5090 / A800, 1.1B-14B parameter range, FP16 / INT8 / NF4) and found two results that contradict common practice:

  1. LLM.int8() default (mixed-precision decomposition) increases inference energy by +80% to +162% vs FP16 on RTX 4090D. Throughput drops 68-75%. Accuracy is essentially preserved (+0.7% to +2% PPL on WikiText-2).
  2. NF4 has an architecture-specific crossover point in the ~3.2B – 5.2B parameter range. Below it, NF4 increases energy by 25-55%. Above it, NF4 saves 15-23%. The crossover moves with the GPU, not the model.

Full dataset (Zenodo DOI 10.5281/zenodo.18900289) and an interactive comparison tool: https://hongping-zh.github.io/


Why I bothered measuring

The folk-wisdom pipeline quantize → fewer bits → less memory → less energy treats energy as a monotone function of bit-width. That’s a proxy, not a measurement. Mixed-precision kernels, dequantization overhead, and reduced arithmetic intensity all push back. I wanted wall-clock NVML numbers, not FLOP counts.


Setup (no embellishment)

  • Hardware: Tesla T4 (Turing, 70 W) · RTX 4090D (Ada) · RTX 5090 (Blackwell) · A800 (Ampere, 300 W)
  • Models: 1.1B – 14B parameter range, including Qwen2-7B, Qwen2.5-3B, Yi-1.5-6B (full list in the dataset)
  • Quantization: FP16 (baseline) · LLM.int8() default · INT8 pure (llm_int8_threshold=0.0) · NF4 (bitsandbytes)
  • Power sampling: NVML at 10 Hz
  • Repetitions: n = 10 per configuration, 3 warmup runs discarded, coefficient of variation < 3%
  • Accuracy: WikiText-2 test split, cross-entropy → perplexity
  • Software: PyTorch 2.4+, bitsandbytes, transformers
  • What I did NOT measure: Apple Silicon, Jetson, GPTQ, AWQ, GGUF. Those are open invitations — see the bottom.

Finding 1 — LLM.int8() default is an energy regression on Ada

On RTX 4090D, the default LLM.int8() path (with outlier mixed-precision decomposition) consistently uses more energy than the FP16 baseline:

Metric vs FP16 on RTX 4090D LLM.int8() default
Energy / request +80% to +162%
Throughput −68% to −75%
Perplexity (WikiText-2) +0.7% to +2% (essentially preserved)

The decomposition path keeps a small set of activations in FP16 and ping-pongs through extra kernels. You pay accuracy-grade overhead for roughly nothing in return — at least on this architecture.

If you set llm_int8_threshold=0.0 to disable the decomposition you do recover energy savings, but at a real accuracy cost: +2.5% to +22.7% WikiText-2 PPL across the tested models. Qwen2.5-3B degrades 15.3%, Yi-1.5-6B 22.7%. So “INT8” isn’t a single thing — it’s at least two operating points with very different energy/accuracy trade-offs.


Finding 2 — NF4 crossover moves with the GPU, not the model

NF4 4-bit quantization isn’t uniformly green. There is a parameter-count threshold below which NF4 increases energy, and that threshold shifts with the GPU architecture:

GPU architecture NF4 crossover (params) INT8 crossover (params)
Turing (T4) ~3.2 B ~4.0 B
Ampere (A800) ~3.7 B ~4.3 B
Ada (RTX 4090D) ~3.9 B ~4.6 B
Blackwell (RTX 5090) ~5.2 B ~5.6 B

Aggregated across the dataset:

  • Below the crossover: quantization adds +25% to +55% energy
  • Above the crossover: quantization saves −15% to −23% energy

Concrete supplemental case from this April: Qwen2.5-3B on T4, NF4 increased energy by +7.4% to +39.9% across batch sizes 1 / 2 / 4. 3B-on-Turing is squarely below the ~3.2 B crossover, and the data behaves as predicted.

The clean reading: on newer architectures, the crossover moves up , not down. The Blackwell memory subsystem makes FP16 cheaper relative to 4-bit dequant, which pushes the threshold from ~3.2 B (Turing) to ~5.2 B (Blackwell). Most of the small models people quantize today (1B–3B “edge-ready” Llama / Phi / Qwen) sit below the crossover on every GPU I tested.


What this means in practice

  1. “Quantize for energy” needs a parameter-count gate. Below ~3-5 B on NVIDIA, FP16 is usually the greener choice.
  2. LLM.int8() ≠ INT8. The default path on Ada is an energy regression. If you want INT8, decide explicitly between the decomposition path (preserves accuracy, pays energy) and threshold=0.0 (saves energy, pays perplexity).
  3. Crossover thresholds are GPU-specific. A config that wins on Ampere can lose on Blackwell. Re-measure when you swap fleet hardware.
  4. Latency is not energy. Reporting tok/s in green-AI claims without wall-power data should be considered insufficient evidence.

Methodology and reproducibility


Open questions I’d love community input on

  • Does the crossover pattern hold for GPTQ / AWQ / GGUF k-quants? My data only covers bitsandbytes NF4 and INT8.
  • Does it hold on Apple Silicon / Jetson / AMD / Intel? I have not measured non-NVIDIA hardware, and the unified-memory dequant story may be qualitatively different.
  • For a fixed model size, how much of the crossover shift between Ampere → Ada → Blackwell is memory bandwidth vs tensor-core throughput? My data can’t separate the two cleanly.

PRs with measurements on additional hardware are welcome at https://github.com/hongping-zh/ecocompute-ai. I’m particularly interested in H100, H200, MI300X, and Apple M3/M4 numbers.

-– Hongping

Discussion in the ATmosphere

Loading comments...