Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiada5mu5egiulb3f3bvkvshac3ea7lqx2vm2bf2afaida2j25s5ie",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjhi6rovlzf2"
  },
  "path": "/t/current-state-and-future-of-integer-only-llm-inference-non-floating-point/175216#post_2",
  "publishedAt": "2026-04-14T12:42:14.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "vLLM",
    "GitHub",
    "arXiv",
    "OpenReview",
    "NVIDIA Docs",
    "Open Compute Project"
  ],
  "textContent": "For now, here’s a rundown of the major options available:\n\nOf the options below, I’ve actually tried BitNet myself in the past. It’s very compact, fast, and produces decent results. However—and this isn’t limited to BitNet—since floating-point calculations are simpler and faster when relying on GPU hardware support, native integer calculations aren’t widely adopted in mainstream frameworks. That said, it might be possible to fine-tune a model using a major framework and then transfer it to an integer-based framework. BitNet also has Llama.\n\n* * *\n\nHere is the current picture as of April 2026.\n\n**Integer-only LLM inference is real, but it is not the mainstream default yet.** Most popular “quantized LLM inference” today is still one of these instead: **weight-only quantization** , **mixed weight/activation quantization** , or **low-precision floating-point serving** such as FP8 or FP4-family formats. The hard part is not just quantizing linear layers. It is also keeping **normalization, softmax, scaling, and dynamic range management** inside a low-bit path without losing too much accuracy or giving the speedup back through dequantization overhead. That is exactly the gap papers like **I-BERT** and **I-LLM** were written to close. (Hugging Face)\n\n## Why this is harder than it looks\n\nA lot of discussions mix up three different things.\n\nFirst, there is **quantized storage** : the checkpoint is saved in INT4, INT8, GGUF, or another compact format. Second, there is **quantized compute for the heavy GEMMs** : the large matrix multiplications use low-bit kernels, but some other operators still run in FP16/BF16/FP32. Third, there is **true integer-only inference** : the whole forward graph, including rescaling, normalization, and softmax-like pieces, stays in integer arithmetic with no float fallback. Most mainstream stacks are strong on the first two. Very few are broadly production-ready for the third across arbitrary LLM checkpoints. (Hugging Face)\n\nThat distinction matters because **GGUF is a file format, not a guarantee about arithmetic semantics**. Hugging Face describes GGUF as a single-file format used to store models for inference with GGML-family executors. That makes it valuable for deployment and portability, but it does not by itself imply “the whole graph runs in integers.” (Hugging Face)\n\n## 1. Current trends: what actually exists right now\n\n### Mainstream libraries\n\nIn official Hugging Face `transformers`, the documented quantization paths are **AWQ** , **GPTQ** , and **8-bit / 4-bit bitsandbytes** support. Those are useful and widely adopted, but the docs do **not** present a general-purpose full integer-only LLM path. They present a practical low-bit deployment stack. (Hugging Face)\n\nIn core `vLLM`, the stable docs expose **INT8 W8A8** and **INT4 W4A16** paths, and the broader `LLM Compressor` docs focus on **FP8, INT8, INT4, NVFP4, and MXFP4**. That tells you where the center of gravity is today: mixed low-precision serving, not universal integer-only execution. There is an important backend-specific exception: **`vllm-ascend` now advertises W4A4 support**, which shows that full low-bit activation paths are getting closer to deployment on specific hardware stacks, but that is still not the same as a universal integer-only mode in mainstream `vLLM`. (vLLM)\n\nSo the direct answer to your first question is:\n\n**For arbitrary Hugging Face checkpoints, there is not yet a broadly production-ready, first-class integer-only path in mainstream`transformers` or core `vLLM`.** What exists today is mostly **hybrid** , **backend-specific** , or **architecture-specific**. (Hugging Face)\n\n### Architecture-specific and kernel-specific paths\n\nThe clearest real example today is **BitNet +`bitnet.cpp`**. Microsoft’s open **BitNet b1.58 2B4T** model is described as a native **1-bit** LLM with **W1.58A8** inference, and both the model card and the Hugging Face docs explicitly warn that standard `transformers` execution does **not** contain the specialized kernels needed to realize the architecture’s real efficiency benefits. In other words, BitNet is important not only because it is low-bit, but because it shows that **model + runtime co-design** matters. (Hugging Face)\n\nA second important project is **T-MAC**. Its repo and paper are highly relevant to your use case because they focus on **mixed-precision matrix multiplication without dequantization** , using lookup tables to avoid the usual “dequantize to higher precision, then compute” penalty. The paper frames the problem almost exactly the way you did: low-bit models often still depend on indirect dequantize-heavy execution, and that overhead is especially painful on CPUs and edge devices. (GitHub)\n\n### Research closest to your definition\n\nIf your definition is strict — _the entire forward graph stays in integers_ — then **I-LLM** is one of the most directly relevant papers. Its core claim is that previous LLM PTQ methods still needed floating-point work for quantize/dequantize and nonlinear operators like **RMSNorm** and **Softmax** , and that a proper integer-only solution needs **integer-only matmul** , **integer-only softmax/exponent approximations** , and **integer-only normalization**. That is much closer to what you are asking about than GPTQ, AWQ, or most GGUF deployments. The caveat is that I-LLM is still best understood as **research with strong systems direction** , not yet a standard, turnkey mode in the dominant serving libraries. (arXiv)\n\n## 2. Accuracy vs efficiency: how bad is the drop?\n\n### W8A8: already practical\n\nFor **W8A8** , the reference point is still **SmoothQuant**. It showed that 8-bit weights plus 8-bit activations can be made accurate enough for large LLMs by smoothing activation outliers, and reported **up to 1.56× speedup** and **2× memory reduction** with **negligible loss in accuracy**. That is one reason W8A8 moved into real systems faster than stricter integer-only schemes: it solves much of the memory and throughput problem without forcing every awkward operator into a pure integer formulation. (GitHub)\n\n`vLLM`’s stable **INT8 W8A8** documentation is another sign that W8A8 has crossed from “interesting paper result” into “supported infrastructure,” at least on the right hardware. (vLLM)\n\n### Why activations are the hard part\n\nA major reason integer-only is harder than weight-only quantization is **activation outliers**. Hugging Face’s `optimum-quanto` docs say per-tensor activation quantization to INT8 can cause serious errors when tensors contain large outliers, often collapsing most values to zero except the outliers. That is precisely why techniques like SmoothQuant exist. It is also why ordinary “just lower the bit-width” thinking breaks down once you try to quantize the whole graph. (GitHub)\n\n### W4A4: much better than before, but still not boring\n\nThere is older evidence showing that **plain W4A4 is hard for decoder-only language models**. A widely cited 2023 study found that W4A4 caused **significant accuracy drop for decoder-only models** , even though it worked much better for encoder-only and encoder-decoder architectures. That paper is still important because it explains why naive “just do everything in 4 bits” failed for autoregressive LLMs for a while. (arXiv)\n\nThen the next wave of work changed the picture:\n\n  * **I-LLM** reports **W4A4 with negligible loss of accuracy** by carefully redesigning the integer-only path. (arXiv)\n  * **QuaRot** reports end-to-end 4-bit quantization of **weights, activations, and KV cache** , with **at most 0.47 WikiText-2 perplexity loss** on LLaMA-2-70B and **99% of zero-shot performance** retained. (arXiv)\n  * **SpinQuant** says learned rotations reduce the full-precision gap on zero-shot reasoning to **2.9 points** on LLaMA-2-7B under 4-bit weights, activations, and KV cache. (arXiv)\n  * **FlatQuant** reports **less than 1% accuracy drop** for **W4A4 on LLaMA-3-70B** , while also claiming strong efficiency gains from fusing its transformations. (OpenReview)\n  * **COMET** argues that practical **W4A4KV4** serving is possible with a mixed-precision activation strategy and optimized **W4Ax** kernels. (arXiv)\n\n\n\nSo the right summary is:\n\n**W8A8 is already practical. W4A4 is now credible. But W4A4 is still far more fragile, transformation-dependent, and kernel-dependent than W8A8.** (arXiv)\n\n### Why papers and practice still diverge\n\nA key systems lesson is that low-bit math is not enough by itself. **QServe** points this out very clearly: it says state-of-the-art INT4 methods can lose **20–90% runtime** to dequantizing either weights or partial sums on GPUs. That is a major reason many “4-bit” systems do not deliver the speedups people expect. If the runtime keeps reconstructing higher precision internally, the theoretical gain shrinks fast. (arXiv)\n\nThat is also why your focus on **true integer-only** is well-placed. The real question is not only “can I compress the checkpoint?” It is “does the execution path stay low-bit all the way through?” (arXiv)\n\n## 3. Future outlook: integer-only vs FP8 / MX / FP4\n\nI do **not** think one format wins everywhere.\n\n### Datacenter direction\n\nFor the latest datacenter GPUs, the trend is clearly toward **FP8** and **FP4-family microscaled formats** , not toward strict integer-only arithmetic as the universal standard. NVIDIA’s TensorRT docs are explicit: **INT4 block quantization supports weight-only quantization** , while **FP4 block quantization supports both weights and activations**. Their architecture docs also say INT4 is used for weight-only quantization and requires dequantization before compute. That is a strong signal about where server inference is going. (NVIDIA Docs)\n\nThe `vLLM` `LLM Compressor` docs point the same way. For **Hopper** , they recommend **W8A8-FP8**. For **Blackwell** , they recommend **NVFP4 or MXFP4** for maximum compression, with FP8 as a balance point. That is not an anti-integer statement. It is a hardware reality statement: on the newest server GPUs, low-precision floating-point formats increasingly line up best with the available fast kernels. (vLLM)\n\nThe **OCP MX** standard matters here too. The spec was published in 2023 and defines interoperable microscaling formats designed to improve energy efficiency across datacenter and endpoint AI. The fact that the industry aligned around a standard family that includes **MXFP8, MXFP6, MXFP4, and MXINT8** is another sign that the ecosystem wants **microscaled low-precision formats** , not only classical integer quantization. (Open Compute Project)\n\n### Edge and CPU/NPU direction\n\nThis is where your thesis looks strongest.\n\nOn edge-class devices, CPUs, older NPUs, and DSP-like accelerators, the value of integer-only or near-integer-native execution is much higher. There, avoiding dequantization and keeping kernels simple can matter more than following the newest FP8/FP4 tensor-core path. That is exactly the space **T-MAC** targets, and it is also why **BitNet** is interesting: both projects are effectively betting that **native low-bit arithmetic plus specialized kernels** matters a lot more off the bleeding edge of server GPUs. (GitHub)\n\nEven `vLLM`’s own scheme guide hints at this split. In the same guidance that recommends FP8 or FP4-family formats for Hopper and Blackwell, it recommends **W4AINT8** on Arm. That is exactly the sort of compromise that makes sense on edge and client hardware: very low-bit weights, integer activations, and hardware fit over theoretical purity. (vLLM)\n\nSo my forecast is:\n\n  * **Cloud / datacenter** : FP8 and FP4-family microscaled formats likely become the mainstream low-precision standard. (NVIDIA Docs)\n  * **Edge / CPU / NPU / DSP-style deployments** : integer-only or near-integer-native inference remains strategically important, especially when paired with model/runtime co-design. (GitHub)\n\n\n\n## 4. What is actually production-ready today?\n\nFor **general open-weight LLM deployment** , the production-ready pieces today are mostly:\n\n  * `transformers` quantization backends such as AWQ, GPTQ, and bitsandbytes 4/8-bit, (Hugging Face)\n  * `vLLM` stable **INT8 W8A8** and **INT4 W4A16** paths, plus `LLM Compressor` for broader mixed-precision schemes, (vLLM)\n  * backend-specific stacks like TensorRT, ONNX Runtime, and OpenVINO, which all support useful low-bit modes, but which are still mostly **weight-only** or **mixed precision** rather than “integer-only everywhere.” (NVIDIA Docs)\n\n\n\nFor **true or near-true integer-centric inference** , the most credible production-ish systems today are **architecture-specific** and **kernel-specific** , not universal:\n\n  * **BitNet +`bitnet.cpp`** for native 1-bit-style models, (GitHub)\n  * **T-MAC** for dequantization-free low-bit GEMM on CPU/NPU-style targets, (GitHub)\n  * hardware-specific offshoots like **`vllm-ascend` W4A4**, which are meaningful signs of progress but not yet a general answer for all models and all runtimes. (vLLM)\n\n\n\n## 5. What I would watch closely\n\nIf you want a practical watch list, I would follow these:\n\n**Research and algorithms**\n\n  * **I-LLM** for the clearest research definition of integer-only LLM inference. (arXiv)\n  * **SmoothQuant** for the most deployable W8A8 baseline. (GitHub)\n  * **QuaRot** , **SpinQuant** , **FlatQuant** , and **COMET** for the current W4A4 / W4A4KV4 frontier. (arXiv)\n\n\n\n**Systems and kernels**\n\n  * **BitNet /`bitnet.cpp`** for native low-bit model/runtime co-design. (GitHub)\n  * **T-MAC** for dequantization-free low-bit CPU/NPU kernels. (GitHub)\n  * **QServe** for the systems reality check that dequantization overhead can erase much of the theoretical INT4 win. (arXiv)\n\n\n\n**Mainstream deployment stacks**\n\n  * **`vLLM` quantization + `LLM Compressor` docs** for what is actually landing in deployable serving software, (vLLM)\n  * **TensorRT** , **ONNX Runtime** , and **OpenVINO** quantization docs to see what production backends really support today and where “INT4” still means weight-only rather than full integer-only execution. (NVIDIA Docs)\n\n\n\n## 6. How I would answer your three questions directly\n\n### 1. Are there production-ready frameworks or architectures for fully integer-only inference right now?\n\nFor arbitrary LLM checkpoints in mainstream libraries, **not really**. The official docs in `transformers` and core `vLLM` emphasize useful quantization paths, but not a general integer-only mode. The most promising real systems today are either **architecture-specific** like **BitNet** , or **kernel-specific** like **T-MAC** , or **backend-specific** like certain W4A4 paths in `vllm-ascend`. (Hugging Face)\n\n### 2. How significant is the accuracy degradation for W8A8 or W4A4 integer-only paths?\n\nFor **W8A8** , the answer is often “small enough to deploy,” especially with SmoothQuant-style handling of outliers. For **W4A4** , the answer is now “sometimes surprisingly small, but still much more fragile.” The best recent papers show strong results, but W4A4 is still much more dependent on careful transformations and optimized kernels than W8A8. (arXiv)\n\n### 3. Will integer-only become the mainstream standard?\n\nMy view is **no, not everywhere**. In datacenters, the likely mainstream is **FP8 plus FP4-family microscaled formats** because that is what current server hardware accelerates well. On edge and integer-centric hardware, though, integer-only or near-integer-native inference still looks strategically important. So I expect a **split future** , not one universal winner. (NVIDIA Docs)\n\n## 7. One practical recommendation\n\nIf your immediate goal is **deploying existing open LLMs today** , I would target **W8A8** or a strong **weight-only INT4 / mixed-precision** path first. Those are where the ecosystem is stable enough to be productive now. If your real goal is **edge research** or **integer-centric deployment** , then I would spend most of my attention on **I-LLM** , **BitNet** , and **T-MAC** , because that is where the field is actually trying to solve the “no float fallback” problem instead of only shrinking checkpoints. (arXiv)\n\nThe cleanest conclusion is this:\n\n**Integer-only LLM inference is no longer a fringe idea, but in 2026 it is still mostly a specialized systems direction rather than the mainstream default.** For general-purpose serving, the world is still dominated by hybrid quantization and, on the newest GPUs, increasingly by FP8 and microscaled FP4-family formats. For edge hardware and integer-centric accelerators, though, integer-only or near-integer-native inference still looks like one of the most promising long-term bets. (vLLM)",
  "title": "Current State and Future of \"Integer-Only\" LLM Inference (Non-Floating Point)"
}