External Publication

I measured 360+ configs — quantization often costs energy below the crossover point

Hugging Face Forums [Unofficial] May 13, 2026

I Measured 360+ Configurations — Quantization Often Costs Energy Below the Crossover Point

TL;DR

I ran NVML-instrumented inference benchmarks across 360+ configurations on four NVIDIA architectures (T4 / RTX 4090D / RTX 5090 / A800, 1.1B-14B parameter range, FP16 / INT8 / NF4) and found two results that contradict common practice:

LLM.int8() default (mixed-precision decomposition) increases inference energy by +80% to +162% vs FP16 on RTX 4090D. Throughput drops 68-75%. Accuracy is essentially preserved (+0.7% to +2% PPL on WikiText-2).
NF4 has an architecture-specific crossover point in the ~3.2B – 5.2B parameter range. Below it, NF4 increases energy by 25-55%. Above it, NF4 saves 15-23%. The crossover moves with the GPU, not the model.

Full dataset (Zenodo DOI 10.5281/zenodo.18900289) and an interactive comparison tool: https://hongping-zh.github.io/

Why I bothered measuring

The folk-wisdom pipeline quantize → fewer bits → less memory → less energy treats energy as a monotone function of bit-width. That’s a proxy, not a measurement. Mixed-precision kernels, dequantization overhead, and reduced arithmetic intensity all push back. I wanted wall-clock NVML numbers, not FLOP counts.

Setup (no embellishment)

Hardware: Tesla T4 (Turing, 70 W) · RTX 4090D (Ada) · RTX 5090 (Blackwell) · A800 (Ampere, 300 W)
Models: 1.1B – 14B parameter range, including Qwen2-7B, Qwen2.5-3B, Yi-1.5-6B (full list in the dataset)
Quantization: FP16 (baseline) · LLM.int8() default · INT8 pure (llm_int8_threshold=0.0) · NF4 (bitsandbytes)
Power sampling: NVML at 10 Hz
Repetitions: n = 10 per configuration, 3 warmup runs discarded, coefficient of variation < 3%
Accuracy: WikiText-2 test split, cross-entropy → perplexity
Software: PyTorch 2.4+, bitsandbytes, transformers
What I did NOT measure: Apple Silicon, Jetson, GPTQ, AWQ, GGUF. Those are open invitations — see the bottom.

Finding 1 — `LLM.int8()` default is an energy regression on Ada

On RTX 4090D, the default LLM.int8() path (with outlier mixed-precision decomposition) consistently uses more energy than the FP16 baseline:

Metric vs FP16 on RTX 4090D	`LLM.int8()` default
Energy / request	+80% to +162%
Throughput	−68% to −75%
Perplexity (WikiText-2)	+0.7% to +2% (essentially preserved)

The decomposition path keeps a small set of activations in FP16 and ping-pongs through extra kernels. You pay accuracy-grade overhead for roughly nothing in return — at least on this architecture.

If you set llm_int8_threshold=0.0 to disable the decomposition you do recover energy savings, but at a real accuracy cost: +2.5% to +22.7% WikiText-2 PPL across the tested models. Qwen2.5-3B degrades 15.3%, Yi-1.5-6B 22.7%. So “INT8” isn’t a single thing — it’s at least two operating points with very different energy/accuracy trade-offs.

Finding 2 — NF4 crossover moves with the GPU, not the model

NF4 4-bit quantization isn’t uniformly green. There is a parameter-count threshold below which NF4 increases energy, and that threshold shifts with the GPU architecture:

GPU architecture	NF4 crossover (params)	INT8 crossover (params)
Turing (T4)	~3.2 B	~4.0 B
Ampere (A800)	~3.7 B	~4.3 B
Ada (RTX 4090D)	~3.9 B	~4.6 B
Blackwell (RTX 5090)	~5.2 B	~5.6 B

Aggregated across the dataset:

Below the crossover: quantization adds +25% to +55% energy
Above the crossover: quantization saves −15% to −23% energy

Concrete supplemental case from this April: Qwen2.5-3B on T4, NF4 increased energy by +7.4% to +39.9% across batch sizes 1 / 2 / 4. 3B-on-Turing is squarely below the ~3.2 B crossover, and the data behaves as predicted.

The clean reading: on newer architectures, the crossover moves up , not down. The Blackwell memory subsystem makes FP16 cheaper relative to 4-bit dequant, which pushes the threshold from ~3.2 B (Turing) to ~5.2 B (Blackwell). Most of the small models people quantize today (1B–3B “edge-ready” Llama / Phi / Qwen) sit below the crossover on every GPU I tested.

What this means in practice

“Quantize for energy” needs a parameter-count gate. Below ~3-5 B on NVIDIA, FP16 is usually the greener choice.
LLM.int8() ≠ INT8. The default path on Ada is an energy regression. If you want INT8, decide explicitly between the decomposition path (preserves accuracy, pays energy) and threshold=0.0 (saves energy, pays perplexity).
Crossover thresholds are GPU-specific. A config that wins on Ampere can lose on Blackwell. Re-measure when you swap fleet hardware.
Latency is not energy. Reporting tok/s in green-AI claims without wall-power data should be considered insufficient evidence.

Methodology and reproducibility

Raw CSVs, scripts, and per-row metadata: https://github.com/hongping-zh/ecocompute-ai
Archived dataset (DOI): https://doi.org/10.5281/zenodo.18900289
Interactive 2-4 model comparison tool: https://hongping-zh.github.io/compare.html
ClawHub advisory skill (free, MIT) that quotes the matching benchmark row before any recommendation: https://clawhub.ai/hongping-zh/ecocompute

@misc{zhang2026llmenergy, author = {Zhang, Hongping}, title = {LLM Energy Benchmark: Real GPU Power Measurements for Quantized Inference}, year = {2026}, doi = {10.5281/zenodo.18900289}, url = {https://doi.org/10.5281/zenodo.18900289}, note = {NVML power monitoring, 4 GPU architectures, 360+ configurations, includes perplexity data} }

Open questions I’d love community input on

Does the crossover pattern hold for GPTQ / AWQ / GGUF k-quants? My data only covers bitsandbytes NF4 and INT8.
Does it hold on Apple Silicon / Jetson / AMD / Intel? I have not measured non-NVIDIA hardware, and the unified-memory dequant story may be qualitatively different.
For a fixed model size, how much of the crossover shift between Ampere → Ada → Blackwell is memory bandwidth vs tensor-core throughput? My data can’t separate the two cleanly.

PRs with measurements on additional hardware are welcome at https://github.com/hongping-zh/ecocompute-ai. I’m particularly interested in H100, H200, MI300X, and Apple M3/M4 numbers.

-– Hongping