External Publication

AIOS: CPU-Native LLM Inference Architecture — Seeking Validation Contributors

Hugging Face Forums [Unofficial] March 26, 2026

I’ve published a framework paper proposing a CPU-native inference architecture for large language models.

Core argument: LLMs are slow on CPU not because CPUs are unsuited to inference, but because models and runtimes were designed for GPU memory architecture and never redesigned for CPU cache hierarchy. AIOS proposes a memory residency controller and Model Contract to close that gap.

What AIOS is:

A runtime (memory residency controller) between inference engines and hardware — reducing DRAM data movement per generated token
A Model Contract — five architectural requirements models can satisfy to expose the full optimization surface

Current state: Paper published, spec complete, validation tooling runnable. Runtime not yet implemented. All performance projections are analytical — no empirical results exist yet.

What I need most: Someone with bare metal Linux (Intel Haswell+ or AMD Zen+, 16GB RAM) to run the Phase 1 baseline measurement on Falcon 7B Q4_K_M using stock llama.cpp. Full protocol in Issue #2. Takes ~2 hours including setup.

Links:

HuggingFace: aios-framework/aios-paper · Hugging Face
Issue #2 (start here): Falcon 7B + AIOS: measure baseline MB/token (primary validation) · Issue #2 · acasavaraju/AIOS · GitHub

Discussion in the ATmosphere