Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiensjep6j5vem2himypws5t72groqxygabacxro327zsvn7dfwpb4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3miar6oyjzaf2"
  },
  "path": "/t/aios-first-ground-truth-baseline-cpu-dram-measurement/174769#post_1",
  "publishedAt": "2026-03-29T23:13:10.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub - acasavaraju/AIOS: CPU-native LLM inference architecture. Memory residency controller that reduces DRAM data movement per generated token through weight aliasing, sparsity maps, KV cache tiering, and activation chunking. Includes Model Contract spec for architecture co-design. Framework + validation tooling — runtime contributions welcome. Paper: SSRN 6467298 · GitHub",
    "@reimorster"
  ],
  "textContent": "AIOS — First Ground Truth Baseline (CPU DRAM Measurement)\n\nFollowing up on my earlier post introducing AIOS (CPU-native LLM inference architecture), we now have the first validated baseline measurement using hardware memory controller counters.\n\nSetup\n\n  * Model: Falcon 7B (GGUF Q4_K_M)\n\n  * CPU: Intel Core Ultra 7 265K (20 cores)\n\n  * OS: Arch Linux (kernel 6.19.10-zen1-1-zen)\n\n  * Method: perf uncore IMC counters (uncore_imc_free_running_0/data_read/)\n\n\n\n\nResults (5 runs × 200 tokens)\n\n  * MB/token: 2340 ± 4 MB\n\n  * Coefficient of Variation: 0.17%\n\n  * Tokens/sec: 11.43 ± 0.05\n\n\n\n\nKey Takeaways\n\n  * The measurement is highly stable (CV < 1%), confirming that DRAM reads can be treated as a reliable physical metric.\n\n  * ~456–459 GB DRAM read for 200 tokens highlights the memory bandwidth wall in CPU inference.\n\n  * This establishes a ground truth baseline for AIOS evaluation.\n\n\n\n\nWhy this matters\n\nMost inference discussions optimize for tokens/sec.\n\nAIOS instead treats MB/token as the primary constraint, because on CPUs, memory movement—not compute—is the bottleneck.\n\nWhat’s next\n\n  * Issue #1: Falcon 7B “relufication” (R1 compliance)\n\n  * Headroom analysis (validation/headroom.py)\n\n  * Additional baselines across models / quantizations\n\n\n\n\nCall for contributors\n\nIf you can run perf on bare-metal Linux, contributions are very valuable:\n\n  * Run baseline measurements on your hardware\n\n  * Validate different models / quantizations\n\n  * Help quantify headroom vs AIOS projections\n\n\n\n\nRepo: GitHub - acasavaraju/AIOS: CPU-native LLM inference architecture. Memory residency controller that reduces DRAM data movement per generated token through weight aliasing, sparsity maps, KV cache tiering, and activation chunking. Includes Model Contract spec for architecture co-design. Framework + validation tooling — runtime contributions welcome. Paper: SSRN 6467298 · GitHub\n\nAcknowledgment\n\nHuge thanks to @reimorster for running the first full validation and helping establish this baseline.\n\nThis is the first step toward making memory movement a first-class metric for LLM inference.",
  "title": "AIOS — First Ground Truth Baseline (CPU DRAM Measurement)"
}