Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiafph7mukkhhczmhvqjl6bhj2taghp6lil4tir7l2m7svup5legnm",
    "uri": "at://did:plc:3fychdutjjusoqeq24ljch6q/app.bsky.feed.post/3mnwgvp5ssto2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreiflo6xt7is6b2iafwghkjahlgggocme5jwjsbeuqqwcywuvjhmszm"
    },
    "mimeType": "image/png",
    "size": 24783
  },
  "path": "/abs/2606.10944v1",
  "publishedAt": "2026-06-10T00:00:00.000Z",
  "site": "https://arxiv.org",
  "tags": [
    "Albert Gong",
    "Annabelle Michael Carrell",
    "Raaz Dwivedi",
    "Lester Mackey"
  ],
  "textContent": "**Authors:** Albert Gong, Annabelle Michael Carrell, Raaz Dwivedi, Lester Mackey\n\nWe introduce a new tool, Express, for converting a non-causal attention approximation into a causal approximation with matching approximation guarantees. When combined with the state-of-the-art Thinformer approximation, Express improves upon the best known causal attention guarantees, delivering $\\log^{3/2}(n)/s$ approximation error with only $O(s)$ memory and $O(s^2 \\log^2(n))$ compression overhead for a sequence of length $n$. We pair these developments with an efficient I/O-aware Triton implementation, demonstrate substantial speedups over FlashAttention 2, and use Express to overcome four resource bottlenecks in the language modeling pipeline: long-context prefill, KV cache compression, long-form memory-constrained decoding, and long-form compute-constrained decoding.",
  "title": "Express Language Modeling"
}