I measured 360+ configs — quantization often costs energy below the crossover point
I Measured 360+ Configurations — Quantization Often Costs Energy Below the Crossover Point
TL;DR
I ran NVML-instrumented inference benchmarks across 360+ configurations on four NVIDIA architectures (T4 / RTX 4090D / RTX 5090 / A800, 1.1B-14B parameter range, FP16 / INT8 / NF4) and found two results that contradict common practice:
LLM.int8()default (mixed-precision decomposition) increases inference energy by +80% to +162% vs FP16 on RTX 4090D. Throughput drops 68-75%. Accuracy is essentially preserved (+0.7% to +2% PPL on WikiText-2).- NF4 has an architecture-specific crossover point in the ~3.2B – 5.2B parameter range. Below it, NF4 increases energy by 25-55%. Above it, NF4 saves 15-23%. The crossover moves with the GPU, not the model.
Full dataset (Zenodo DOI 10.5281/zenodo.18900289) and an interactive comparison tool: https://hongping-zh.github.io/
Why I bothered measuring
The folk-wisdom pipeline quantize → fewer bits → less memory → less energy treats energy as a monotone function of bit-width. That’s a proxy, not a measurement. Mixed-precision kernels, dequantization overhead, and reduced arithmetic intensity all push back. I wanted wall-clock NVML numbers, not FLOP counts.
Setup (no embellishment)
- Hardware: Tesla T4 (Turing, 70 W) · RTX 4090D (Ada) · RTX 5090 (Blackwell) · A800 (Ampere, 300 W)
- Models: 1.1B – 14B parameter range, including Qwen2-7B, Qwen2.5-3B, Yi-1.5-6B (full list in the dataset)
- Quantization: FP16 (baseline) ·
LLM.int8()default · INT8 pure (llm_int8_threshold=0.0) · NF4 (bitsandbytes) - Power sampling: NVML at 10 Hz
- Repetitions: n = 10 per configuration, 3 warmup runs discarded, coefficient of variation < 3%
- Accuracy: WikiText-2 test split, cross-entropy → perplexity
- Software: PyTorch 2.4+, bitsandbytes, transformers
- What I did NOT measure: Apple Silicon, Jetson, GPTQ, AWQ, GGUF. Those are open invitations — see the bottom.
Finding 1 — LLM.int8() default is an energy regression on Ada
On RTX 4090D, the default LLM.int8() path (with outlier mixed-precision decomposition) consistently uses more energy than the FP16 baseline:
| Metric vs FP16 on RTX 4090D | LLM.int8() default |
|---|---|
| Energy / request | +80% to +162% |
| Throughput | −68% to −75% |
| Perplexity (WikiText-2) | +0.7% to +2% (essentially preserved) |
The decomposition path keeps a small set of activations in FP16 and ping-pongs through extra kernels. You pay accuracy-grade overhead for roughly nothing in return — at least on this architecture.
If you set llm_int8_threshold=0.0 to disable the decomposition you do recover energy savings, but at a real accuracy cost: +2.5% to +22.7% WikiText-2 PPL across the tested models. Qwen2.5-3B degrades 15.3%, Yi-1.5-6B 22.7%. So “INT8” isn’t a single thing — it’s at least two operating points with very different energy/accuracy trade-offs.
Finding 2 — NF4 crossover moves with the GPU, not the model
NF4 4-bit quantization isn’t uniformly green. There is a parameter-count threshold below which NF4 increases energy, and that threshold shifts with the GPU architecture:
| GPU architecture | NF4 crossover (params) | INT8 crossover (params) |
|---|---|---|
| Turing (T4) | ~3.2 B | ~4.0 B |
| Ampere (A800) | ~3.7 B | ~4.3 B |
| Ada (RTX 4090D) | ~3.9 B | ~4.6 B |
| Blackwell (RTX 5090) | ~5.2 B | ~5.6 B |
Aggregated across the dataset:
- Below the crossover: quantization adds +25% to +55% energy
- Above the crossover: quantization saves −15% to −23% energy
Concrete supplemental case from this April: Qwen2.5-3B on T4, NF4 increased energy by +7.4% to +39.9% across batch sizes 1 / 2 / 4. 3B-on-Turing is squarely below the ~3.2 B crossover, and the data behaves as predicted.
The clean reading: on newer architectures, the crossover moves up , not down. The Blackwell memory subsystem makes FP16 cheaper relative to 4-bit dequant, which pushes the threshold from ~3.2 B (Turing) to ~5.2 B (Blackwell). Most of the small models people quantize today (1B–3B “edge-ready” Llama / Phi / Qwen) sit below the crossover on every GPU I tested.
What this means in practice
- “Quantize for energy” needs a parameter-count gate. Below ~3-5 B on NVIDIA, FP16 is usually the greener choice.
LLM.int8()≠ INT8. The default path on Ada is an energy regression. If you want INT8, decide explicitly between the decomposition path (preserves accuracy, pays energy) andthreshold=0.0(saves energy, pays perplexity).- Crossover thresholds are GPU-specific. A config that wins on Ampere can lose on Blackwell. Re-measure when you swap fleet hardware.
- Latency is not energy. Reporting tok/s in green-AI claims without wall-power data should be considered insufficient evidence.
Methodology and reproducibility
Raw CSVs, scripts, and per-row metadata: https://github.com/hongping-zh/ecocompute-ai
Archived dataset (DOI): https://doi.org/10.5281/zenodo.18900289
Interactive 2-4 model comparison tool: https://hongping-zh.github.io/compare.html
ClawHub advisory skill (free, MIT) that quotes the matching benchmark row before any recommendation: https://clawhub.ai/hongping-zh/ecocompute
@misc{zhang2026llmenergy, author = {Zhang, Hongping}, title = {LLM Energy Benchmark: Real GPU Power Measurements for Quantized Inference}, year = {2026}, doi = {10.5281/zenodo.18900289}, url = {https://doi.org/10.5281/zenodo.18900289}, note = {NVML power monitoring, 4 GPU architectures, 360+ configurations, includes perplexity data} }
Open questions I’d love community input on
- Does the crossover pattern hold for GPTQ / AWQ / GGUF k-quants? My data only covers bitsandbytes NF4 and INT8.
- Does it hold on Apple Silicon / Jetson / AMD / Intel? I have not measured non-NVIDIA hardware, and the unified-memory dequant story may be qualitatively different.
- For a fixed model size, how much of the crossover shift between Ampere → Ada → Blackwell is memory bandwidth vs tensor-core throughput? My data can’t separate the two cleanly.
PRs with measurements on additional hardware are welcome at https://github.com/hongping-zh/ecocompute-ai. I’m particularly interested in H100, H200, MI300X, and Apple M3/M4 numbers.
-– Hongping
Discussion in the ATmosphere