{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiafph7mukkhhczmhvqjl6bhj2taghp6lil4tir7l2m7svup5legnm",
"uri": "at://did:plc:3fychdutjjusoqeq24ljch6q/app.bsky.feed.post/3mnwgvp5ssto2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreiflo6xt7is6b2iafwghkjahlgggocme5jwjsbeuqqwcywuvjhmszm"
},
"mimeType": "image/png",
"size": 24783
},
"path": "/abs/2606.10944v1",
"publishedAt": "2026-06-10T00:00:00.000Z",
"site": "https://arxiv.org",
"tags": [
"Albert Gong",
"Annabelle Michael Carrell",
"Raaz Dwivedi",
"Lester Mackey"
],
"textContent": "**Authors:** Albert Gong, Annabelle Michael Carrell, Raaz Dwivedi, Lester Mackey\n\nWe introduce a new tool, Express, for converting a non-causal attention approximation into a causal approximation with matching approximation guarantees. When combined with the state-of-the-art Thinformer approximation, Express improves upon the best known causal attention guarantees, delivering $\\log^{3/2}(n)/s$ approximation error with only $O(s)$ memory and $O(s^2 \\log^2(n))$ compression overhead for a sequence of length $n$. We pair these developments with an efficient I/O-aware Triton implementation, demonstrate substantial speedups over FlashAttention 2, and use Express to overcome four resource bottlenecks in the language modeling pipeline: long-context prefill, KV cache compression, long-form memory-constrained decoding, and long-form compute-constrained decoding.",
"title": "Express Language Modeling"
}