{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiad775fa33mrugqzmsxexzcmpcjgk54paovnm3ljbwmkyv37ejej4",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mohlvf5okgd2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreifanldaryzsyyt7u3ygm75d5a3wrfosxscbrhjujurpa6tcqzrko4"
},
"mimeType": "image/webp",
"size": 59018
},
"path": "/creeta/kog-hits-3k-ts-on-mi300x-no-kernel-switches-test-it-now-55dh",
"publishedAt": "2026-06-17T04:59:23.000Z",
"site": "https://dev.to",
"tags": [
"amd",
"mi300x",
"kogai",
"monokernel",
"playground.kog.ai",
"Kog engineering writeup",
"Kog AI blog",
"AITER",
"blog.kog.ai"
],
"textContent": "AMD's MI300X has long had more single-request inference headroom than the default ROCm stack exposes. A Paris startup just showed how much — by deleting the per-token kernel launch entirely.\n\n## How the monokernel eliminates kernel-launch overhead\n\nA monokernel is a single, persistent GPU-resident program that runs an entire LLM decode pass — prefill, decode, LM-head sampling, and the EOS stop check — without returning to the host CPU or launching a new kernel per token. Kog AI reports **3,000+ output tokens/s per request** for an FP16 2B model at batch size 1 on one 8× MI300X node , the engine behind the Kog Inference Engine tech preview launched 28 May 2026. That matters because batch-1 decoding is bound by HBM bandwidth, not compute — so the dead time between kernels dominates.\n\n**Quick Answer:** Standard MI300X stacks launch one GPU kernel per token, each paying ~4.5 μs launch overhead plus HBM restart latency. Kog's monokernel collapses the whole decode loop into one persistent kernel with zero CPU interaction, reaching 3,000+ tokens/s per request on an 8× MI300X node (FP16 2B model, batch 1).\n\nConventional stacks — vLLM, SGLang, ROCm/HIP pipelines — launch a fresh kernel for every stage of every token. Kog quantifies the recurring tax that removes :\n\nOverhead source | Cost per occurrence\n---|---\nKernel launch (per stage) | ~4.5 μs\nHBM latency on each memory-load restart | ~0.5 μs\nIntermediate tensor materialization round-trip to HBM | >1 μs\n\nSynchronization is rebuilt to match. Instead of atomic arrival counters, buffers initialize to NaN and consumers poll until real data appears — sentinel-value polling that cuts sync latency from ~7.8 μs to ~0.9 μs, though synchronization still eats roughly 35% of token-generation time . Is the peak number solid? A topology-tuned variant grouping compute units by HBM die adjacency is cited at **3,300 tokens/s** , but that figure comes from a secondary report rather than the primary blog (which states 3,000+) — treat the exact peak cautiously, as this is single-vendor, self-reported data with no independent benchmark yet.\n\n## KIE playground or raw HIP: which to choose\n\nThere are exactly two ways to engage with Kog's work today, and they sit at opposite ends of the effort spectrum. The hosted Kog Inference Engine (KIE) playground is a zero-setup, browser-accessible demo; the raw HIP replication is a research-level undertaking. For nearly every developer, the playground is the only immediately actionable option — the HIP path is not a weekend project.\n\nThe playground at playground.kog.ai runs the Laneformer 2B coding model — which scores roughly 50% on HumanEval — on Kog's own 8× MI300X cluster . You interact with the model in the browser and watch the per-request token rate firsthand, with no hardware to provision. It is the fastest way to verify the latency claim with your own prompts.\n\nThe HIP replication path is a different category of work. To reproduce the monokernel you need an AMD Instinct GPU, a ROCm 6.x stack, and deep HIP/assembly experience — the implementation required hand-written inline assembly for atomics on 3-dword types, manual register-pressure management (LICM, instruction inspection), and a custom cross-GPU timestamp profiling harness synced via the HSA API .\n\nCrucially, as of June 2026 there is no open-source kernel and no pip package . The Kog engineering blog is the only public implementation reference — a detailed writeup, not a clonable repo. If you want the technique, you reimplement it from the prose.\n\n## Hands-on: from KIE playground to HIP replication\n\nStart at the playground, then escalate to HIP only if you need the technique itself. The zero-setup path is playground.kog.ai: open the page, submit a coding prompt, and watch the per-request token counter in the response UI. The model behind it is Laneformer 2B running FP16 on Kog's 8× MI300X node, scoring roughly 50% on HumanEval, with no login required for the tech preview launched 28 May 2026 . That single page is enough to verify the latency claim with your own prompts.\n\nTo replicate the technique, orientation comes from the Kog engineering writeup. It documents compile-time work partitioning, a 256-compute-unit grid with `gridDim=(256,)` and `blockDim=(64,8)`, and tensor duplication per I/O die to avoid cross-die reduction penalties on the chiplet design . Two implementation details matter most before you attempt the full loop:\n\n * **GEMV, not GEMM.** At batch size 1 the vector-matrix multiply is a GEMV, so the monokernel uses scalar/vector ALU `dot2` instructions rather than matrix cores — tensor cores only earn their keep once batch size fills their tile . Replicate this for your weight shapes first.\n * **Delayed Tensor Parallelism (DTP).** TP reductions from attention and FFN are deferred and folded into later layers, so cross-GPU traffic over Infinity Fabric runs asynchronously, hidden behind arithmetic. This is what makes the 8-GPU lane split viable without a synchronous communication wall .\n\n\n\n> \"The monokernel collapses the entire decode loop — including sampling and the EOS stop check — into one persistent kernel, so the host CPU never re-enters the path,\" per Kog's engineering team (source: Kog AI blog).\n\nIf hand-written HIP and inline assembly are more than you want to own, start one level up with AMD's AITER (AI Tensor Engine for ROCm) — the sanctioned reference, with Triton, Composable Kernel, HIP, and hand-tuned assembly backends already wired into vLLM and SGLang . A minimal \"does Kog respond\" check looks like the illustrative snippet below — it is not executed here (it needs the Kog runtime/CLI), and exits cleanly when that dependency is absent:\n\n\n\n import importlib.util\n import shutil\n import subprocess\n import sys\n\n if not shutil.which(\"kog\") and importlib.util.find_spec(\"kog\") is None:\n raise SystemExit(\"needs dependency: kog runtime/CLI\")\n\n cmd = [\"kog\", \"bench\", \"--device\", \"mi300x\", \"--target-tps\", \"3000\", \"--no-kernel-switches\"]\n print(\"+\", \" \".join(cmd))\n out = subprocess.check_output(cmd, text=True, stderr=subprocess.STDOUT)\n print(out)\n\n\n## What the 3K t/s figures don't cover\n\nThe headline numbers describe one narrow configuration: a custom 2B-parameter \"Laneformer\" model running at FP16 and batch size 1 on a single 8× MI300X node . As of June 2026, there is no published evidence that the monokernel generalizes to larger dense or MoE architectures, to FP8 or other quantized precisions, to batch sizes above 1, or to multi-node setups — the AI Weekly summary flags exactly these as unproven .\n\nThe results are also entirely self-reported. No independent third-party benchmark has appeared, and the widely circulated 3,300 t/s figure originates from AI Weekly's 29 May 2026 write-up of a topology-tuned variant, not Kog's primary blog, which states 3,000+ . Treat the exact peak cautiously until someone outside Kog reproduces it.\n\nThe cross-vendor comparison carries the same caveat: Kog reports a sibling monokernel reaching ~2,100 t/s on 8× NVIDIA H200 under identical FP16, batch-1 conditions — also self-reported, with no external validation.\n\nFinally, several capabilities developers will want are roadmap items, not shipping features. Kog lists third-party MoE model support, quantization such as FP8, speculative decoding, and larger batch sizes as planned but not yet delivered .\n\n## Going deeper: chiplet anatomy and what comes next\n\nTo understand why topology tuning matters, look at the die map. The MI300X is a CDNA3 chiplet design: 8 Accelerator Compute Dies (XCDs) holding 304 compute units total — 38 per XCD — sitting atop 4 I/O dies (IODs), with 192 GB of HBM3 at roughly 5.3 TB/s peak bandwidth . Kog's monokernel deliberately uses 256 of the 304 CUs and duplicates tensors per IOD, trading a little memory for the avoidance of cross-die all-reduce penalties that would otherwise stall a single-request decode .\n\nIf you want to start on MI300X attention kernels without Kog-level resources, AMD's AITER MLA decode tutorial on the ROCm AI Developer Hub is the lowest-friction on-ramp. It targets Ubuntu 22.04 and ROCm 6.3.1, runs in a Docker container with `/dev/kfd` and `/dev/dri` exposed, and walks through cloning AITER recursively, running `python3 setup.py develop`, and calling `mla_decode_fwd` directly .\n\nAs for Kog itself, the KIE tech preview post lists third-party MoE models, additional batch sizes, quantization, and speculative decoding as planned, with no dates attached . The takeaway: the 3K t/s number is a single-request, single-model proof point, not a general benchmark — try the playground today, watch blog.kog.ai for the roadmap, and reach for AITER when you need a reproducible kernel path now.\n\n## Frequently asked questions\n\n### Do I need an AMD MI300X to try the Kog Inference Engine?\n\nNo. The KIE tech preview is a hosted browser playground at playground.kog.ai, running the Laneformer 2B coding model on Kog's own 8× MI300X cluster . You interact through the browser and watch the per-request token rate directly — no local GPU, drivers, or setup required. You only need your own MI300X if you want to replicate the monokernel in HIP from the engineering writeup yourself.\n\n### Why does the monokernel skip tensor cores and use scalar/vector ALU instead?\n\nAt batch size 1, decode is a GEMV (matrix-vector multiply), not a GEMM, so matrix cores stay idle. Tensor/matrix-core primitives only pay off when the batch is large enough to fill their tile; a single-vector multiply cannot. Kog therefore implements the projection with scalar/vector ALU `dot2` instructions, which are faster for batch-1 decode where HBM bandwidth — not compute — is the bottleneck .\n\n### What is Delayed Tensor Parallelism and why does it matter here?\n\nDelayed Tensor Parallelism (DTP) defers the tensor-parallel all-reduce from attention and FFN and folds it into the computation of later layers, so cross-GPU traffic over Infinity Fabric runs asynchronously, hidden behind arithmetic . This avoids the synchronous communication stall that normally penalizes 8-GPU tensor parallelism at batch 1, where the model is split into 8 lanes across 8 GPUs and a blocking reduction per layer would otherwise dominate latency.\n\n### How does AMD's AITER differ from what Kog built?\n\nAITER (AI Tensor Engine for ROCm) is a framework-level operator library with Triton, Composable Kernel, HIP, and hand-tuned assembly backends, already wired into vLLM and SGLang production-serving paths . Kog's monokernel is the opposite: a hand-crafted, compile-time work-partitioned single kernel with no framework abstraction, written in HIP with inline assembly. It is lower-level, not open-sourced, and demonstrated only on a custom 2B model — AITER is the reproducible path when you need a kernel today .\n\n### Is the 3,300 tokens/s figure from the Kog blog?\n\nNo. The Kog engineering blog states 3,000+ output tokens per second per request for an FP16 2B model at batch size 1 on a single 8× MI300X node . The 3,300 figure appeared in an AI Weekly summary on 29 May 2026 describing a topology-tuned variant . With no independent replication as of June 2026, treat 3,000+ as the primary number.",
"title": "Kog hits 3K t/s on MI300X, no kernel switches — test it now"
}