{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiat2i255dlhmyoh5gdkpgcgl642sqzb74y46jekdi5yixl7yantv4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlbnyc3brau2"
  },
  "path": "/t/tempus-a-resource-invariant-gemm-framework-for-versal-ai-edge-607-gops-on-16-cores-open-source-c/175826#post_1",
  "publishedAt": "2026-05-07T14:44:22.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "https://github.com/mgrailoo/TEMPUS",
    "https://arxiv.org/abs/2605.00536"
  ],
  "textContent": "**TL;DR:** We built a GEMM framework that achieves **607 GOPS** on AMD Versal AI Edge using only **16 AIE-ML cores** , without scaling hardware resources. The complete C++/HLS code is open-source.\n\n**The Problem:** Most SOTA GEMM frameworks scale by adding more cores (spatial scaling). This fails on resource-limited edge SoCs due to routing congestion and bandwidth saturation.\n\n**Our Solution (Tempus):**\n\n  * **Temporal scaling** instead of spatial: fixed 16-core compute block.\n\n  * **Algorithmic data tiling & replication** on Programmable Logic.\n\n  * **Deadlock-free DATAFLOW** with II=1 cascade streaming.\n\n\n\n\n**Results (on Versal AI Edge):**\n\n  * **607 GOPS** at 10.7W total on-chip power.\n\n  * **22x core frugality** vs. spatial SOTA (ARIES).\n\n  * **211x higher platform-aware utility (PAU)**.\n\n  * Zero URAM/DSP utilization.\n\n\n\n\n**Repository:** https://github.com/mgrailoo/TEMPUS\n**Paper:** https://arxiv.org/abs/2605.00536\n\nThe repo includes end-to-end flows from PyTorch comparison to hardware deployment. We hope this provides a sustainable foundation for edge LLM inference on Versal.\n\nHappy to answer any questions about the implementation, tiling schemes, or performance metrics!",
  "title": "Tempus: A Resource-Invariant GEMM Framework for Versal AI Edge (607 GOPS on 16 cores + open-source C++)"
}