External Publication

OpenAI-led consortium seeks to address AI processing bottlenecks

Network World [Unofficial] May 8, 2026

An OpenAI-led consortium of tech giants including AMD, Broadcom, Intel, Microsoft, and Nvidia have unveiled a new networking protocol designed to address network congestion, a problem that has always existed but has been exacerbated by the massive amounts of data required for AI processing. The new protocol, called Multipath Reliable Connection (MRC), is for training models on 100,000+ GPUs by distributing traffic across hundreds of network paths simultaneously rather than forcing it down a few lanes that can get easily congested. “Network congestion, link, and device failures are the most common sources of delay and jitter in transfers,” OpenAI wrote in a blog post announcing the project. “These problems get more frequent, and harder to solve, as the size of the cluster increases.” It went on to note that a single failure could often cause a training job to crash, forcing a restart from a saved checkpoint, or stall progress for many seconds while the network recomputed routes. Such interruptions are costly in both GPU cycles and time. “The larger the job we run, the greater the impact of any single link flap or failure. These workloads act as a form of ‘failure amplifier,’ so preventing this has become critical,” the company said. OpenAI led the development of the protocol and worked with AMD, Broadcom, Intel, Microsoft, and Nvidia, all of whom made significant technical contributions. The project is hosted and coordinated by the Open Compute Platform (OCP) consortium. Nvidia is making its presence felt with the use of its Spectrum-X Ethernet as a part of MRC. The company says it is running MRC in production at some of the world’s largest AI training clusters, including OpenAI, for training frontier LLM models like ChatGPT and Codex. Spectrum-X is also used in Microsoft’s Fairwater and Oracle Cloud Infrastructure (OCI’s) Abilene data center (a part of Project Stargate), two of the largest AI factories purpose-built for training and deploying leading-edge frontier LLMs. MRC delivers the best GPU utilization possible by load-balancing traffic across all available paths, avoiding congestion by dynamically avoiding overloaded paths in real time. Conventional network fabrics can take seconds or even tens of seconds to stabilize after failures, according to OpenAI. This helps keep maximum GPU utilization while training runs through network slowdowns, congestion, or failures or other events that would ordinarily disrupt or stall the training process. Administrators also gain fine-grained visibility and control over traffic paths, monitoring network traffic from a simple, single pane of glass. OpenAI says MRC’s multi-plane network design can connect more than 100,000 GPUs using only two tiers of Ethernet switches, rather than the three or four tiers currently required by standard 800 Gb/s networks. The MRC specification was published today through the Open Compute Project along with an accompanying research paper.

Discussion in the ATmosphere