External Publication

Tempus: A Resource-Invariant GEMM Framework for Versal AI Edge (607 GOPS on 16 cores + open-source C++)

Hugging Face Forums [Unofficial] May 7, 2026

TL;DR: We built a GEMM framework that achieves 607 GOPS on AMD Versal AI Edge using only 16 AIE-ML cores , without scaling hardware resources. The complete C++/HLS code is open-source.

The Problem: Most SOTA GEMM frameworks scale by adding more cores (spatial scaling). This fails on resource-limited edge SoCs due to routing congestion and bandwidth saturation.

Our Solution (Tempus):

Temporal scaling instead of spatial: fixed 16-core compute block.
Algorithmic data tiling & replication on Programmable Logic.
Deadlock-free DATAFLOW with II=1 cascade streaming.

Results (on Versal AI Edge):

607 GOPS at 10.7W total on-chip power.
22x core frugality vs. spatial SOTA (ARIES).
211x higher platform-aware utility (PAU).
Zero URAM/DSP utilization.

Repository: https://github.com/mgrailoo/TEMPUS Paper: https://arxiv.org/abs/2605.00536

The repo includes end-to-end flows from PyTorch comparison to hardware deployment. We hope this provides a sustainable foundation for edge LLM inference on Versal.

Happy to answer any questions about the implementation, tiling schemes, or performance metrics!

Discussion in the ATmosphere