Tempus: A Resource-Invariant GEMM Framework for Versal AI Edge (607 GOPS on 16 cores + open-source C++)
TL;DR: We built a GEMM framework that achieves 607 GOPS on AMD Versal AI Edge using only 16 AIE-ML cores , without scaling hardware resources. The complete C++/HLS code is open-source.
The Problem: Most SOTA GEMM frameworks scale by adding more cores (spatial scaling). This fails on resource-limited edge SoCs due to routing congestion and bandwidth saturation.
Our Solution (Tempus):
Temporal scaling instead of spatial: fixed 16-core compute block.
Algorithmic data tiling & replication on Programmable Logic.
Deadlock-free DATAFLOW with II=1 cascade streaming.
Results (on Versal AI Edge):
607 GOPS at 10.7W total on-chip power.
22x core frugality vs. spatial SOTA (ARIES).
211x higher platform-aware utility (PAU).
Zero URAM/DSP utilization.
Repository: https://github.com/mgrailoo/TEMPUS Paper: https://arxiv.org/abs/2605.00536
The repo includes end-to-end flows from PyTorch comparison to hardware deployment. We hope this provides a sustainable foundation for edge LLM inference on Versal.
Happy to answer any questions about the implementation, tiling schemes, or performance metrics!
Discussion in the ATmosphere