{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiautx2re5bibn6hqwjwexg457xek4xggv4t55dhum5symtp33v5hu",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mh25h4mplhp2"
},
"path": "/t/kv-cache-precision-compatibility-in-spatial-disaggregation-prefill-decode-setups-with-awq-gptq-models/174261#post_1",
"publishedAt": "2026-03-14T11:59:52.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Hi everyone, We are setting up a distributed inference pipeline using a Prefill-Decode disaggregation topology to mitigate the local memory I/O bound. The prefill runs on a remote high-compute node, and the decoding runs on a local edge node. If we deploy an INT4 quantized model (e.g., AWQ or GPTQ) on the edge node, does the incoming KV Cache from the remote prefill node strictly need to be quantized into the same format before transmission? Or can the quantized attention layers on the decode node natively accept an FP16 KV cache tensor transferred via RPC without significant overhead? Any insights on managing this quantization mismatch in split inference would be highly appreciated.",
"title": "KV Cache precision compatibility in Spatial Disaggregation (Prefill-Decode) setups with AWQ/GPTQ models"
}