Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreig2cczujuqgsu55h2hbjlngyjtxkmjr5zqod7h752amf7mrkj6qsm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhowgje25622"
  },
  "path": "/t/s2lc-100-lora-adapters-in-3-59ms-by-reconstructing-weights-in-gpu-registers-never-writing-to-hbm/174532#post_1",
  "publishedAt": "2026-03-22T17:43:58.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "code re",
    "https://github.com/QQQTech/S2LC"
  ],
  "textContent": "code repo: https://github.com/QQQTech/S2LC\n\nS2LC (Shared Spectral Low-Rank Compression) exploits shared spectral structure across neural network modules derived from the same base model. A shared basis matrix V_common (shape D×R, FP16) is computed once per layer via truncated SVD across the module population; each module’s unique contribution U_k (shape D×R) is projected onto V_common and encoded in two compact codebooks at approximately 3 bits per element. At inference, the fused Triton kernel computes y = x × V_common × U_kᵀ by reconstructing U_k values directly in the GPU register file during the tiled GEMM, producing no intermediate HBM writes; the only write is the final output tensor. CUDA Graph capture eliminates CPU-side kernel launch overhead. Results: 10.1× memory compression over standard LoRA, 3.59 ms forward-pass latency for K=100 concurrent adapters, zero intermediate HBM writes verified by NVIDIA Nsight Compute. Extensions to MoE expert compression, KV cache compression, and variable-depth serving are described in Sections 5–7 and are currently theoretical — the algorithm is specified but not yet benchmarked.",
  "title": "S2LC – 100 LoRA adapters in 3.59ms by reconstructing weights in GPU registers, never writing to HBM"
}