Beyond the fan: Crossing the liquid cooling rubicon
The infrastructure inflection point
The artificial intelligence (AI) infrastructure revolution has made an unlikely discipline suddenly relevant: thermodynamics. My perspective draws on a mechanical engineering background in advanced heat transfer, reinforced by a decade of leading data center transformations across Europe and ongoing conversations with technology executives navigating these challenges today. The views expressed are personal — not the position of any organization.
The numbers tell the story. For the past decade, enterprise racks hummed along at 10 to 15 kilowatts (kW) each. Facilities teams knew how to manage them. The computer room air conditioning (CRAC) units worked. The hot-aisle containment — physical barriers that separate hot exhaust air from cold supply air to improve cooling efficiency — held. The math was familiar. Then AI training nodes arrived — and organizations faced 100 kW racks. NVIDIA’s Blackwell platform has accelerated this trajectory dramatically: the GB200 NVL72 system packs 72 graphics processing units (GPUs) into a single rack, drawing 120 to 130 kW, while the forthcoming GB300 NVL72 will push 135 to 140 kW per rack. This was not gradual evolution. It was a tenfold increase that rendered entire thermal architectures obsolete overnight.
One executive recently described the moment his facilities director ran the calculations. “We can fit exactly three of these racks before we exceed our cooling capacity,” the director said. Three racks. The organization had ordered forty-eight. A $15 million AI initiative was about to collide with a thermal wall that no software optimization could breach. Variations of this story now echo across the industry.
Most chief information officers (CIOs) treat the data center as a black box — something the facilities team handles while they focus on applications and strategy. That comfortable division of labor is ending as AI roadmaps collide not with talent shortages or budget constraints but with the inability to remove heat from silicon.
Executives ask when generative AI platforms will be operational, only to learn that timelines depend not on developers or data scientists but on whether teams can dissipate 4.8 megawatts (MW) of heat from rooms designed for 1.2 MW. The expressions tell the same story every time. This is not the constraint business leaders expected. The infrastructure inflection point is not coming — for those deploying high-density AI workloads, it has already arrived.
The physics of failure: Why air hits its limit at 20 kilowatts
Thermodynamics does not negotiate. Air cooling works through convection — the transfer of heat through the movement of fluids (in this case, air). Fans push cold air across heat sinks — metal structures with fins that increase surface area to dissipate heat — attached to processors. Heated air rises and gets captured by return ducts. Reliable and straightforward — until you scale it. The fundamental issue is thermal conductivity: the ability of a material to transfer heat. As IEEE Spectrum has documented, water conducts heat roughly 25 times more efficiently than air. To remove equivalent heat, air cooling demands exponentially more airflow — and that airflow creates cascading problems.
At 20 kW per rack, the airflow velocity required to maintain safe operating temperatures triggers two failure modes. First, the acoustic vibration becomes severe enough to damage equipment. Organizations learn this lesson the hard way — high-frequency vibration from upgraded CRAC units causing bit errors in high-density Non-Volatile Memory Express (NVMe) storage arrays. The signature is mechanical resonance in drive enclosures. Fans shake storage infrastructure to death.
Second, the power required for that airflow becomes self-defeating. At 100 kW densities, nearly 30 percent of the total facility power goes to fans alone — before accounting for compressors and chillers working overtime to cool the air. According to Uptime Institute research, data centers spend an estimated $1.9 to $2.8 million per MW annually on operations, with cooling-related costs consuming nearly $500,000 of that figure. The American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) TC 9.9 guidelines governing data center thermal management were written for a 15 kW world. Many organizations now operate so far outside those parameters that the guidelines have become irrelevant.
One moment crystallized this reality. A single CRAC unit failed in a training cluster. Within eight minutes, hot-aisle temperatures exceeded 120°F. Monitoring systems triggered automatic throttling on millions of dollars of compute infrastructure. A multi-day processing run crashed and restarted from a checkpoint. Standing in that sweltering aisle watching temperature readouts climb, the conclusion was inescapable: air had carried the industry as far as it could go.
Crossing the Rubicon: Cold plates versus rear-door heat exchangers
Bringing liquid into a data center is terrifying. Water — or water-adjacent fluids — enters rooms filled with equipment worth tens of millions of dollars. Equipment that fails catastrophically when wet. “Crossing the Rubicon” captures the commitment: once started down this path, there is no returning to the comfortable certainty of air cooling.
The two primary architectures organizations evaluate are direct-to-chip (DTC) cold plates and rear-door heat exchangers (RDHx). Understanding both matters because the most successful implementations deploy a hybrid approach.
Cold plate systems pump coolant directly through metal plates, making physical contact with processors. The engineering elegance is remarkable. Instead of moving heat through air to a distant cooling system, heat conducts directly into liquid flowing inches from silicon. The most effective implementations use a secondary fluid distribution loop with a coolant distribution unit (CDU) at each row. The CDU receives chilled water from the central plant and uses heat exchangers to cool the secondary loop that touches servers. This architecture can handle the 1,000-watt-plus thermal design power (TDP) — the maximum heat a processor generates under load — of individual Blackwell GPUs. These are thermal loads that would require hurricane-force airflow to dissipate through convection alone.
Fluid chemistry requires more attention than most teams anticipate. Deionized water seems the obvious choice: maximum thermal conductivity and zero mineral deposits. But deionized water is aggressive. It wants to ionize and will corrode aluminum and copper to achieve equilibrium. A PG25 mixture — 25 percent propylene glycol in deionized water, such as Dow’s DOWFROST LC — represents the right trade-off. The glycol provides corrosion inhibition and freeze protection for loop segments passing through unconditioned spaces. The thermal performance penalty relative to pure water is roughly 5 percent, worth accepting for corrosion protection.
RDHx units solve a different problem. Even with cold plates removing 80 percent of heat directly from processors, voltage regulator modules (VRMs) — the circuitry that converts and regulates power delivery to processors — and memory still generate significant thermal load. Traditionally, that heat enters the hot aisle. RDHx units mount to each rack’s rear and capture exhaust heat before it reaches the room — the cleanup crew handling thermal energy cold plates cannot reach.
A colleague recently described his organization’s first liquid cooling deployment. The rack held eight high-density GPUs — hundreds of thousands of dollars of silicon, not counting chassis and networking. A technician connected quick-disconnect fluid couplings. The manifold pressurized. Everyone held their breath. Every leak scenario played through their minds. The team had implemented zone-based leak detection with rope sensors along every fluid path and drip trays under every potential failure point. Prevention systems only matter until they do not.
The connection held. Coolant began flowing. Within minutes, processor temperatures dropped thirty degrees while fans fell to barely audible levels. Terror converted into operational capability. That conversion required months of planning and leak-detection infrastructure rivaling network monitoring — but it worked. The efficiency gains are substantial: NVIDIA reports that hyperscale facilities deploying liquid-cooled GB200 systems achieve up to 25 times the energy efficiency of air-cooled architectures, translating to over $4 million in annual savings for a 50 MW data center. This story, with minor variations, now repeats across the industry.
The RoCE revolution: Tuning the fabric for East-West traffic
Solving the thermal problem reveals another constraint many teams do not appreciate until they hit it: network architecture. Traditional data center networks handle north-south traffic — data flowing between external clients and internal servers that cross the network perimeter. AI training workloads generate massive east-west traffic — data moving laterally between servers within the data center as GPUs synchronize gradients and share model state. The patterns differ fundamentally, and most networks are not ready.
The choice between InfiniBand — a high-speed interconnect technology designed specifically for low-latency, high-bandwidth computing — and Remote Direct Memory Access over Converged Ethernet (RoCE) consumes weeks of analysis for every organization tackling this challenge. InfiniBand remains the gold standard for latency-sensitive high-performance computing (HPC) workloads. NVIDIA’s networking division will happily sell complete InfiniBand fabrics. But InfiniBand requires specialized expertise that most teams lack. It is a parallel universe from the Ethernet — the ubiquitous networking standard that connects most of the world’s computers — that teams have spent decades mastering. RoCE v2 offers a path to Remote Direct Memory Access (RDMA) performance while leveraging existing Ethernet skills and infrastructure — in my experience, often the right choice for enterprise environments.
Decision Framework: InfiniBand vs. RoCE v2
| Factor | InfiniBand | RoCE v2 |
|---|---|---|
| Latency | Sub-microsecond (gold standard) | Low microseconds (adequate for most) |
| Team Expertise | Requires specialized HPC skills | Leverages existing Ethernet expertise |
| Infrastructure | Dedicated fabric required | Converged with existing Ethernet |
| Best For | Hyperscale, dedicated AI clusters | Enterprise, mixed workloads |
The technical implementation requires rebuilding the understanding of quality of service (QoS) — network mechanisms that prioritize certain traffic types over others. Traditional Ethernet is lossy by design: packets drop, and Transmission Control Protocol (TCP) handles retransmissions. RDMA does not work that way. A single dropped packet can invalidate an entire memory transfer and force a retry cascading across the training cluster. Creating a lossless fabric on Ethernet requires Priority Flow Control (PFC) — defined in the IEEE 802.1Qbb standard — and that is where real complexity begins.
PFC tells upstream switches to stop transmitting when downstream buffers fill. Done poorly, this creates head-of-line blocking where a congested flow stalls unrelated traffic. Done very poorly, it creates PFC storms in which pause frames propagate across the entire fabric, and nothing moves. Tuning PFC thresholds demands precision — pausing congested flows before packets drop, but releasing them before blocking propagates.
The breakthrough comes when teams properly configure Explicit Congestion Notification (ECN), specified in RFC 3168. ECN marks packets when queue depths exceed configurable thresholds, signaling sources to reduce transmission rates. Setting ECN marking thresholds below PFC trigger points creates a graduated response. Light congestion triggers ECN, and flows slow voluntarily. Only severe congestion triggers the PFC pause. The result: a fabric that breathes — expanding and contracting with workload demands rather than oscillating between full speed and complete stop.
The pattern is consistent across organizations attempting this transition. Validation attempts running 175-billion-parameter model synchronizations across entire GPU clusters collapse on the first try — PFC storms freeze entire fabrics. Every link shows 100% utilization, while actual throughput drops to near zero. Engineers crowd into network operations centers, watching packet captures that look like chaos. Traffic is not flowing; it is thrashing.
Diagnosis typically takes days. Default ECN marking thresholds — often configured at 50 percent queue depth — prove far too aggressive for RDMA traffic patterns. By the time ECN signals congestion, buffers are already filling fast enough to trigger PFC. Once PFC triggers, back-pressure propagates faster than ECN signals can throttle sources. Teams chase runaway reactions.
After tuning ECN thresholds to mark at 10 percent queue depth rather than 50 percent, synchronizations complete smoothly with sustained throughput of 380 gigabits per second. The difference between success and catastrophic failure is a single parameter change requiring a deep understanding of traffic flow through the topology. The lesson is clear: technology leaders must personally understand network QoS configuration before deploying any significant AI workload.
Becoming grid-interactive: BESS and the new power calculus
The final piece of infrastructure transformation addresses power delivery. AI workloads draw inconsistent power. A training cluster — hardware running the computationally intensive process of teaching AI models — idles at perhaps 20 percent of peak draw, then spikes to full capacity when computation begins. These step loads — sudden, large changes in power demand — stress the electrical infrastructure in ways traditional enterprise computing never approached.
Utility providers communicate clearly: organizations cannot simply request 50 MW of additional capacity and expect it to appear. The grid has constraints. Transformers and substations require years to upgrade. Meanwhile, AI roadmaps demand capacity unavailable from the grid on business timelines.
Battery Energy Storage Systems (BESS) provide the bridge. Megawatt-class BESS installations serve two functions. First, they handle step loads by supplementing grid power during the seconds it takes the utility supply to ramp. When training clusters transition from idle to full load, the instantaneous power draw would otherwise cause a voltage sag across facilities. BESS responds in milliseconds — not seconds — smoothing jarring demand spikes with sub-second precision that generators cannot match.
Second, BESS enables demand response — the practice of adjusting power consumption based on grid conditions and pricing signals. Organizations draw from batteries during peak pricing periods and recharge during off-peak hours. During summer afternoons when grid demand and prices peak, four-hour duration BESS installations carry meaningful portions of AI workloads. Peak shaving — reducing consumption during high-cost periods — alone can reduce annual energy costs by 20 to 30 percent, with payback periods under three years in high-rate markets. The economic benefit matters but remains secondary to capability — BESS lets organizations deploy AI workloads without waiting for grid upgrades, creating a power buffer that decouples operational timelines from utility infrastructure schedules.
This strategic evolution requires new metrics. Power Usage Effectiveness (PUE) measures total facility power divided by IT equipment power. A PUE of 1.4 means 40 percent overhead for cooling and infrastructure. But PUE reveals nothing about whether power is productive. A more useful metric is Power Compute Effectiveness (PCE) — useful AI operations per kilowatt-hour. PCE lets technology leaders explain to boards not just efficiency, but that power consumption translates into intelligence at a measurable rate. The Green Grid has published extensive research on these evolving data center efficiency metrics.
The software layer coordinating all this requires significant development. Automated workload scheduling that considers power pricing, grid carbon intensity — the amount of CO₂ emitted per unit of electricity — and thermal headroom in real time represents the state of the art. Non-urgent training jobs queue for periods when power is cheap and clean. Urgent inference workloads — production AI systems responding to real-time requests — run immediately regardless of cost. The system treats electricity as a resource to optimize alongside compute and storage.
The 12-month roadmap: From assessment to operations
For technology leaders beginning the liquid cooling journey, the following timeline provides a realistic framework for moving from assessment to operational capability.
| Phase | Key Activities |
|---|---|
| Months 1–2 Assessment | Thermal audit of existing facilities; power capacity analysis; workload density projections; vendor evaluation for CDUs and cold plates; network QoS baseline assessment |
| Months 3–4 Design | Architecture selection (DTC, RDHx, or hybrid); fluid loop design; leak detection system specification; network fabric design for RoCE/InfiniBand; BESS sizing and placement |
| Months 5–7 Procurement | Long-lead equipment orders (CDUs: 12–16 weeks; switchgear: 20+ weeks); contractor selection for mechanical/electrical work; PG25 coolant sourcing; leak detection infrastructure |
| Months 8–10 Installation | Physical infrastructure deployment; piping and manifold installation; CDU commissioning; network fabric buildout; ECN/PFC threshold configuration; BESS integration |
| Months 11–12 Validation | Thermal stress testing; network performance validation under load; leak detection verification; runbook development; operations team training; production cutover |
The new mandate for the technical executive
The journey from “infrastructure is someone else’s problem” to understanding fluid loops and ECN thresholds represents a fundamental shift in technology leadership. Technology executives are not service providers ordering capabilities from vendors. They are energy and thermal architects whose decisions about physics directly enable or constrain AI strategy.
The executives who succeed in deploying AI at scale will be those who stop delegating the physical layer and start owning it. No one builds an AI-forward organization on infrastructure they do not understand. The thermal wall is real. The network complexity is real. The power constraints are real. Solving them requires technical leaders willing to get into the weeds — leaders who recognize that the most strategic decisions in AI may involve coolant chemistry and switch buffer depths rather than algorithms and models.
Lessons from sweltering hot aisles — whether experienced directly or heard from peers navigating the same challenges — teach more about AI infrastructure than any analyst report or vendor presentation. The insight is simple: in the age of AI, the data center is not a black box to be managed. It is the foundation that makes everything else possible. Technology leaders who understand this will shape the future. Those who do not will watch their AI strategies collide with physics — and physics always wins.
**This article is published as part of the Foundry Expert Contributor Network. ** Want to join?
Discussion in the ATmosphere