Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiev2a3xd5a6doyqbtjl2whhaps2a5r2irwwqt5hvv7ulbzrnmztg4",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mpbfijrolra2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreif6o7da6acnel3qb6rrppcpwtoenhaxvybtt4dt2z5lipfq3glwwy"
    },
    "mimeType": "image/webp",
    "size": 112486
  },
  "path": "/pueding/openai-and-broadcoms-jalapeno-a-custom-inference-asic-inference-asic-vs-gpu-36jm",
  "publishedAt": "2026-06-27T11:21:30.000Z",
  "site": "https://dev.to",
  "tags": [
    "ai",
    "llm",
    "hardware",
    "machinelearning",
    "Read the announcement →",
    "NVIDIA AI factories — tokens per megawatt",
    "AMD Atom — prefill/decode disaggregation",
    "Blackwell on MLPerf 6.0 — strong scaling",
    "Jetson Thor — edge Blackwell",
    "Learn AI Visually"
  ],
  "textContent": "**What:** The **OpenAI and Broadcom Jalapeño announcement** (June 24, 2026) is OpenAI's **first custom LLM-inference ASIC** — a reticle-sized compute chiplet paired with HBM, built to **run** models rather than train them. The idea it makes concrete is an **inference-optimized ASIC versus a general-purpose GPU**.\n\n**Why:** At decode time the bottleneck is usually **moving data, not doing math** , so a chip co-designed around that movement can serve the same tokens using **far less power per token** — early testing reports substantially better performance-per-watt (final numbers still being measured), which at OpenAI's scale materially changes serving cost.\n\n**vs prior:** A **general-purpose GPU** runs anything — training, graphics, every model — and pays in silicon and power for that flexibility; Jalapeño is **hard-wired for inference only** , trading the GPU's versatility for a shorter, faster path between memory and compute.\n\n##  Think of it as\n\nA kitchen rebuilt to cook one dish, with the pantry moved beside the stove.\n\n\n\n                      THE ONE DISH: LLM inference\n                                │\n                ┌───────────────┴───────────────┐\n                │                               │\n         ┌──────▼───────┐                ┌──────▼───────┐\n         │ Inference    │                │ General      │\n         │ ASIC         │                │ GPU          │\n         │ (one dish)   │                │ (whole menu) │\n         └──────┬───────┘                └──────┬───────┘\n                │                               │\n       pantry beside the stove        pantry down the hall\n       (HBM next to compute)          (data travels far)\n                │                               │\n                ▼                               ▼\n       ✓ most plates per gas          ✗ pays power for\n         (perf-per-watt)                flexibility unused\n\n\n  * inference ASIC = a kitchen rebuilt to cook one dish, as fast and cheaply as possible\n  * general-purpose GPU = a restaurant kitchen that can cook anything on the menu\n  * data-movement bottleneck = cooks spending the night carrying ingredients from a far pantry\n  * HBM beside the compute chiplet = moving the pantry right next to the stove\n  * performance-per-watt = more plates served for every unit of gas burned\n\n\n\n##  Quick glossary\n\n**ASIC** — An **Application-Specific Integrated Circuit** — silicon built for **one kind of job** rather than general-purpose computing. Giving up a general processor's flexibility buys speed and energy efficiency on that job. Jalapeño's job is LLM inference.\n\n**HBM** — High-Bandwidth Memory — stacked DRAM placed **physically very close to the compute die** so data reaches the math units faster. It is the same fast memory used on high-end GPUs, and it is where the model actually lives during serving.\n\n**Inference vs training** — Training **builds** a model's weights; inference **runs** the finished weights to generate tokens. They stress hardware differently, so a chip can be excellent at one and unable to do the other. Jalapeño is **inference-only**.\n\n**Memory-bandwidth-bound** — When a computation spends most of its time **waiting for data to arrive from memory** rather than doing arithmetic. Single-token decode is the classic example: lots of bytes read, little math per byte.\n\n**Tape-out** — The moment a chip design is finished and **sent to the fab to be manufactured**. Jalapeño went from first design to tape-out in **roughly nine months** , which OpenAI describes as one of the fastest such cycles to date.\n\n**Reticle-sized chiplet** — The _reticle_ is the largest area a chip-making machine can pattern in a single exposure (around 800 mm²). A **reticle-sized compute chiplet** is about as large as one die can physically get — Jalapeño pairs one such tile with HBM.\n\n**Performance-per-watt** — Useful work (tokens generated) divided by the **electrical power it costs**. At data-center scale this — not peak speed alone — sets the bill, which is why a custom inference chip targets it directly.\n\n> **The news.** On June 24, 2026, **OpenAI and Broadcom** unveiled **Jalapeño** , OpenAI's first \"Intelligence Processor\" — a purpose-built **ASIC for LLM inference** , not a repurposed training accelerator or a general-purpose AI chip. It pairs a single **reticle-sized compute chiplet** with **HBM** (not commodity DRAM) to hold high throughput and low latency together, and was co-designed from first design to **tape-out in roughly nine months**. Engineering samples are already running production workloads in the lab, including **GPT-5.3-Codex-Spark** , with early testing reporting performance-per-watt \"substantially better\" than current state-of-the-art (final numbers still being measured). Initial deployment is targeted for **end of 2026**. Read the announcement →\n\nPicture a restaurant kitchen that can cook anything on the menu — pastry, grill, soup, all of it. That flexibility is wonderful, and it is exactly what a **general-purpose GPU** gives you: thousands of programmable cores that will run any parallel workload you throw at them, from training a model to rendering a game. **Jalapeño is that kitchen torn down and rebuilt to cook one dish — LLM inference — and nothing else.** The bet is that if you only ever cook one dish, a kitchen shaped around that single dish will cook it faster and far more cheaply than the do-everything kitchen ever could.\n\nSo what is the \"one dish\" actually limited by? Here is the part that surprises people: **at decode time, the thing slowing the kitchen down is not the chef's hands — it is the cooks walking ingredients in from a far pantry.** When a model generates a token, at small batch sizes it must stream the model's weights out of memory and through the compute units once, while doing comparatively little arithmetic per byte read. That makes single-token decode **memory-bandwidth-bound** — the roofline tips toward memory, and the math units sit mostly idle, waiting on data. The bottleneck the whole chip is fighting is _data movement_.\n\n\n\n    Single-token decode — where the time goes:\n\n    moving data  ████████████████████████████████  dominates\n    computing    █                                  a sliver\n\n\nThe diagram makes the imbalance concrete: in the bandwidth-bound regime, the pink \"moving data\" segment dominates and the green \"computing\" segment is a sliver. **Jalapeño's answer is the obvious one once you see the problem — move the pantry next to the stove.** It pairs that big compute chiplet with **HBM kept physically close** , so the costly trip between memory and compute is as short and as fast as the silicon allows. OpenAI says the design was derived from its _own_ measurements of how its models behave at serving time, which is what \"co-designed\" really means here: the chip is shaped around the bottleneck the company actually observed, not a generic one.\n\nWalk the decode math on a single token _(illustrative numbers — OpenAI has not published Jalapeño's figures)_. Say a model holds **100 GB of weights** and the accelerator reads them from memory at **4 TB/s**. Generating one token must stream those weights through compute roughly once, so the time is about **100 GB ÷ 4 TB/s = 25 ms** — and across that 25 ms the arithmetic units are mostly idle, waiting. Now **double the effective memory bandwidth and that 25 ms roughly halves** ; double the raw compute instead and almost nothing changes. **That is the whole reason an inference chip is built around feeding the math units, not stacking more of them** — and why the headline metric is _performance-per-watt_ , not peak FLOPs.\n\nNone of this means GPUs are going away. The trade Jalapeño makes is real and one-directional: **you give up the GPU's ability to train, to switch to a very different kind of workload, to run the whole range of models and tasks a GPU handles.** A custom ASIC only pays off when you run one workload at enormous, sustained scale — which is precisely OpenAI's situation, and precisely why a startup serving a thousand requests a day would still reach for a GPU. The interesting signal is not \"ASICs beat GPUs\"; it is that LLM inference has become a large and stable enough workload to justify burning a chip for it.\n\nChip | Built for | Flexibility | Where it wins\n---|---|---|---\nGeneral-purpose GPU | training + inference + any parallel workload | Highest | The default — runs anything, backed by a mature software ecosystem\nRepurposed training accelerator | training, also used to serve | High | Strong throughput, but carries training-only hardware that idles during inference\n**Inference ASIC (Jalapeño)** | **LLM inference only** | Lowest | **Built for top performance-per-watt on its one workload at scale (early results); inference-only, far less flexible**\n\n_Goes deeper in: GPU & CUDA → Roofline Model → The Bottleneck Question_\n\n###  Related explainers\n\n  * NVIDIA AI factories — tokens per megawatt — frames serving as a performance-per-watt problem at the datacenter scale Jalapeño is built to win.\n  * AMD Atom — prefill/decode disaggregation — another hardware answer to the fact that prefill and decode stress the chip in opposite ways.\n  * Blackwell on MLPerf 6.0 — strong scaling — the general-purpose GPU side of the same inference-efficiency race.\n  * Jetson Thor — edge Blackwell — purpose-built inference silicon at the opposite end of the scale, the edge.\n\n\n\n##  FAQ\n\n###  What is an inference ASIC like Jalapeño?\n\nAn inference ASIC is an Application-Specific Integrated Circuit — silicon built for one kind of job rather than general-purpose computing — made to run (not train) large language models. OpenAI and Broadcom's Jalapeño, unveiled June 24, 2026, is OpenAI's first such chip: a reticle-sized compute chiplet paired with HBM, co-designed around the data-movement bottleneck of serving models at scale. It gives up a GPU's general-purpose flexibility in exchange for higher performance-per-watt on that single workload (early testing reports substantially better, with final numbers still being measured).\n\n###  Why build a custom inference chip instead of using GPUs?\n\nAt decode time, generating a token is usually memory-bandwidth-bound — the chip spends most of its time moving the model's weights out of memory, not doing arithmetic. A general-purpose GPU pays in silicon and power for flexibility that inference never uses. A chip co-designed around the data-movement bottleneck — a large compute chiplet with HBM kept close — can serve the same tokens at substantially better performance-per-watt in early testing (final numbers still being measured), which at OpenAI's scale materially changes serving cost.\n\n###  How is Jalapeño different from a GPU?\n\nA GPU is general-purpose: thousands of programmable cores that run training, graphics, and any model. Jalapeño is an ASIC built for LLM inference only — it cannot train and is far less flexible than a general-purpose GPU. That is the trade: it loses the GPU's versatility and gains a shorter, faster path between memory and compute, which is what matters when the bottleneck is data movement rather than raw math. A custom ASIC pays off only when you run one workload at enormous, sustained scale.\n\nOriginally posted on Learn AI Visually.",
  "title": "OpenAI and Broadcom's Jalapeño, a Custom Inference ASIC: Inference ASIC vs GPU"
}