External Publication
Visit Post

Nvidia targets inference as AI’s next battleground with Groq 3 LPX

Network World [Unofficial] March 18, 2026
Source

2026 is predicted to be the year that AI moves from pilot to production, becoming measurably useful across the enterprise. But while many businesses are ready, the underlying infrastructure doesn’t seem to be, particularly when it comes to next-stage inferencing.

Nvidia says it has overcome these limitations, achieving what it calls a “milestone” in accelerated computing.

The chip company today unveiled the Nvidia Groq 3 LPX inference accelerator for Vera Rubin GPUs. The combined architecture is optimized for “trillion-parameter models and million-token context” that Nvidia claims can deliver up to 35X higher inference throughput per megawatt, and up to 10x more revenue opportunity.

Groq 3 LPX was announced today at Nvidia GTC as part of an architecture comprising seven new chips and five racks meant to work together as “one big supercomputer.”

The release represents a paradigm shift, Nvidia said, with architecture moving to running inferencing workloads in production, rather than merely training large language models (LLMs).

“Whereas training is a ‘forget budget, forget power, let’s get this model trained ASAP’ kind of thinking, inference is persistent/sustained performance of AI-powered workflows and applications,” noted Matt Kimball, VP and principal analyst at Moor Insights & Strategy.

It’s a big cost play, he pointed out, and it “has to happen everywhere, all the time, for all users.”

The next phase of inferencing

The new Groq 3 language processing units (LPUs) are based on intellectual property (IP) from Groq, which signed a $20 billion licensing agreement with Nvidia late last year. According to the chip company, a fleet of LPUs can function as a “giant single processor.”

While Rubin GPUs will continue to handle prefill (prompt processing), Groq’s LPX will now handle latency-sensitive portions of decode (response). Together, they can deliver a “new class of inference performance,” Nvidia says.

Each LPX rack features 256 LPUs with 128 GB of on-chip static random-access memory (SRAM), 150 terabyte per second (TB/s) bandwidth, chip-to-chip links and high-speed connections to NVL72, Nvidia’s liquid-cooled AI supercomputer. Combined, these can reduce latency to “near zero,” Nvidia claims.

The LPX integration with Vera Rubin AI factories will be available in the second half of this year.

Training versus inferencing

Training and inference stress infrastructure in very different ways, noted Sanchit Vir Gogia, chief analyst at Greyhound Research. While training rewards “massive parallelism and brute-force scale,” inferencing (especially for long context and interactive reasoning) is far more sensitive to latency, memory movement, cache behavior, concurrency, and cost per delivered token.

GPUs are “phenomenal” for training, but the industry has reached a point where one dominant GPU story is no longer enough, said Gogia. Training is finite, while inference is continuous: Every prompt, tool call, reasoning step, retrieval cycle, and agent loop consumes resources in production.

LPX is addressing the “ugliest part” of the AI infrastructure stack, he said, and the challenge is not just raw compute. Current AI deployments “begin to wobble” when they have to combine long context, sequential token generation, memory pressure, and low-latency expectations, all while keeping expensive infrastructure usable amidst unpredictable, interactive demand.

“Nvidia is now openly redesigning accelerated computing around inference as a distinct systems problem, rather than pretending the same architecture can elegantly handle everything from training to long-context, interactive, agentic inference,” said Gogia. “That is the real shift.”

Coupling prefill and decode functions

LPX is elemental because it addresses the prefill-decode split. Prefill and decode are two fundamental, yet distinct, stages of LLM inferencing.

Kimball explained that prefill is the prompt: A question is entered, interpreted and “a whole bunch of data from a bunch of sources” is collected to create context and determine the correct answer. On the other side of the coin, decode (also known as autoregressive reasoning) occurs when the user sees the response.

“Inference is really two workloads sitting under one header: prefill and decode,” said Kimball. “Prefill is highly parallelized, decode is highly serialized.”

GPUs are optimal for prefill because they excel at highly-parallelized functions; accelerators like Groq are better for decode because they are good at highly-serialized tasks like token generation.

“The faster that decode is, the better my agentic workflows behave,” said Kimball.

He pointed out that AWS and Cerebras also recently announced a partnership to support this type of disaggregated inference environment via Bedrock, and called the Nvidia announcement a shift in not only AI economics, but inference economics.

“We have these trained models, and inference is where AI is actually realized in the enterprise,” Kimball noted. “What good are these models if they are not making processes more accurate, faster, and more efficient?”

The takeaway for IT leaders

Still, it’s important to understand that LPX is not a “generic enterprise technology story,” Gogia noted.

“It is a specialized infrastructure response to the demands of premium, latency-sensitive, memory-intensive inference workloads,” he said, emphasizing, “IT leaders should not get hypnotized by Nvidia’s performance framing.”

The first question every IT leader should be asking is “brutally simple”: Do they actually need this class of infrastructure for their workloads?

Because, in reality, most enterprises don’t need trillion-parameter inference and million-token context as a default operating model. Many still struggle to govern smaller-scale generative AI deployments, let alone industrial-scale agentic systems.

The bigger unlock for enterprises in the next phase of AI will come from better model routing, caching, software optimization, memory management, workflow redesign, and inference telemetry, not from “jumping straight to the most advanced rack-scale architecture,” he noted.

Another important consideration is internal workload economics: What is the cost per useful token for an application? What happens when context expands, users increase, or agents begin chaining more reasoning steps? How much of the infrastructure is truly being utilized?

“These are the real questions, because AI infrastructure is increasingly about ‘goodput’ [good output], not just throughput,” said Gogia.

Additionally, he noted, IT leaders should consider memory a “strategic constraint.” Long context and KV-cache growth do not disappear, and, while Nvidia’s “clever” answer to this is tiering, externalizing context memory, and orchestrating across racks, it makes architecture decisions more complex.

Further, power and cooling must be treated as first-order variables, and leaders must pay close attention to ecosystem maturity and lock-in, Gogia noted. Nvidia is trying to own not only the silicon layer, but the system design, orchestration, and storage tiering, while dominating the economic narrative around premium tokens.

That makes software portability and ecosystem flexibility essential. “The winners in this next phase will not be the organizations that simply buy more AI infrastructure,” said Gogia. “They will be the ones that know exactly where premium inference matters, where it does not, and how to govern the difference.”

Discussion in the ATmosphere

Loading comments...