{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiat2i255dlhmyoh5gdkpgcgl642sqzb74y46jekdi5yixl7yantv4",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlbnyc3brau2"
},
"path": "/t/tempus-a-resource-invariant-gemm-framework-for-versal-ai-edge-607-gops-on-16-cores-open-source-c/175826#post_1",
"publishedAt": "2026-05-07T14:44:22.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"https://github.com/mgrailoo/TEMPUS",
"https://arxiv.org/abs/2605.00536"
],
"textContent": "**TL;DR:** We built a GEMM framework that achieves **607 GOPS** on AMD Versal AI Edge using only **16 AIE-ML cores** , without scaling hardware resources. The complete C++/HLS code is open-source.\n\n**The Problem:** Most SOTA GEMM frameworks scale by adding more cores (spatial scaling). This fails on resource-limited edge SoCs due to routing congestion and bandwidth saturation.\n\n**Our Solution (Tempus):**\n\n * **Temporal scaling** instead of spatial: fixed 16-core compute block.\n\n * **Algorithmic data tiling & replication** on Programmable Logic.\n\n * **Deadlock-free DATAFLOW** with II=1 cascade streaming.\n\n\n\n\n**Results (on Versal AI Edge):**\n\n * **607 GOPS** at 10.7W total on-chip power.\n\n * **22x core frugality** vs. spatial SOTA (ARIES).\n\n * **211x higher platform-aware utility (PAU)**.\n\n * Zero URAM/DSP utilization.\n\n\n\n\n**Repository:** https://github.com/mgrailoo/TEMPUS\n**Paper:** https://arxiv.org/abs/2605.00536\n\nThe repo includes end-to-end flows from PyTorch comparison to hardware deployment. We hope this provides a sustainable foundation for edge LLM inference on Versal.\n\nHappy to answer any questions about the implementation, tiling schemes, or performance metrics!",
"title": "Tempus: A Resource-Invariant GEMM Framework for Versal AI Edge (607 GOPS on 16 cores + open-source C++)"
}