{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreig2cczujuqgsu55h2hbjlngyjtxkmjr5zqod7h752amf7mrkj6qsm",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhowgje25622"
},
"path": "/t/s2lc-100-lora-adapters-in-3-59ms-by-reconstructing-weights-in-gpu-registers-never-writing-to-hbm/174532#post_1",
"publishedAt": "2026-03-22T17:43:58.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"code re",
"https://github.com/QQQTech/S2LC"
],
"textContent": "code repo: https://github.com/QQQTech/S2LC\n\nS2LC (Shared Spectral Low-Rank Compression) exploits shared spectral structure across neural network modules derived from the same base model. A shared basis matrix V_common (shape D×R, FP16) is computed once per layer via truncated SVD across the module population; each module’s unique contribution U_k (shape D×R) is projected onto V_common and encoded in two compact codebooks at approximately 3 bits per element. At inference, the fused Triton kernel computes y = x × V_common × U_kᵀ by reconstructing U_k values directly in the GPU register file during the tiled GEMM, producing no intermediate HBM writes; the only write is the final output tensor. CUDA Graph capture eliminates CPU-side kernel launch overhead. Results: 10.1× memory compression over standard LoRA, 3.59 ms forward-pass latency for K=100 concurrent adapters, zero intermediate HBM writes verified by NVIDIA Nsight Compute. Extensions to MoE expert compression, KV cache compression, and variable-depth serving are described in Sections 5–7 and are currently theoretical — the algorithm is specified but not yet benchmarked.",
"title": "S2LC – 100 LoRA adapters in 3.59ms by reconstructing weights in GPU registers, never writing to HBM"
}