External Publication
Visit Post

Network outages, power failures strain data center resiliency

Network World [Unofficial] May 14, 2026
Source

Data center outages are becoming less frequent overall, but resiliency gains are slowing as data center operators face mounting pressure from AI workloads, aging power infrastructure, and external dependencies, according to Uptime Institute’s newly released 2026 Annual Outage Analysis report.

The report marks the fifth consecutive year that outage frequency on a per-site basis has declined, continuing a long-term trend Uptime analysts attribute to improved operational maturity, distributed resiliency strategies, and infrastructure investments. But the report also suggests the industry may be reaching diminishing returns from traditional resiliency approaches as system complexity increases.

“We believe that over time, failures will increasingly not be the result of a single point of failure, but instead be linked to complex interactions between systems, including software, networks, and external dependencies. While site-based electrical and mechanical infrastructure remain a critical building block that needs to be resilient, digital infrastructure is becoming more distributed with outages originating outside the data center, including those tied to power availability, network connectivity, or the reliance on external cloud services playing a larger role,” said Andy Lawrence, founding member and executive director, Uptime Intelligence, in a statement.

According to Uptime’s survey data, half of operators reported experiencing an impactful outage within the past three years, down from 74% in 2020. However, about one in 10 respondents said their most recent outage was serious or severe. Uptime analysts said that stable outage rates do not necessarily indicate lower operational risk. As organizations run increasingly critical workloads across interconnected environments, even isolated incidents can trigger broader service disruptions across cloud, networking, and application infrastructure, according to Uptime.

“Digital infrastructure is remarkably resilient,” Lawrence said during a webinar discussing the findings. “But further resiliency gains are becoming harder to achieve.”

Power failures are the leading cause of data center outages

Power failures continue to dominate data center outage causes, accounting for 45% of impactful outages in Uptime’s latest survey data. While that figure declined from the previous year, it remains significantly higher than any other category.

Within power-related incidents, UPS failures, transfer switch failures, and generator failures are the leading root causes. Uptime analysts said growing grid instability, power constraints, and high-density compute deployments are creating new pressure points for operators already running closer to capacity limits.

“We are being asked to be bigger, cleaner, faster, smarter, and more resilient all at once,” Amber Villegas-Williamson, research analyst at Uptime Institute, said during the webinar.

The report also notes that shortages of critical infrastructure equipment, including transformers, generators, switchgear, and UPS systems, are forcing some operators to rely on substitute or secondhand components, which Uptime believes have contributed to several failures and incidents. In addition, Uptime said major data center fires have gradually increased in recent years, with lithium-ion batteries in UPS systems identified as a contributing factor in some incidents.

On-site power generation or distribution failures are the most common cause of serious outages, according to the Uptime Institute’s Global Data Center Survey.

Uptime Institute

Networking and connectivity outages remain a top operational risk

When Uptime expands the scope of its research and includes outages that occur outside of the data center, network weaknesses become a more frequent contributor to outages.

The firm’s Data Center Resiliency Survey takes a broader view of outage causes that extend beyond—but include—the data center to track the most common causes of end-to-end IT service outages. From that perspective, network and connectivity issues are the most frequently reported cause of IT service-related outages, Uptime reports. Uptime researchers said increasingly distributed architectures and reliance on third-party infrastructure are making network resiliency just as important as facility resiliency.

According to Uptime’s 2026 resiliency survey, the most common causes of IT service-related outages are:

  • Networking/connectivity: 23%
  • Power: 21%
  • No IT services outages: 19%
  • IT system/software: 18%
  • Third-party IT services, including public cloud and SaaS: 10%
  • Cooling: 8%
  • Other: 2%

Uptime analysts said the growing use of software-based strategies, automated failover, and traffic rerouting has helped reduce some outages, but increasingly interconnected environments also make failures harder to isolate and contain. The report suggests that enterprise network teams may need to place greater emphasis on WAN resiliency, route redundancy, and cross-provider visibility as connectivity disruptions affect a growing number of critical services.

Digging deeper into the data, the top causes of network-related outages are: configuration and change management failures; third-party network provider failures; and hardware failures.

Configuration and change management failures are the most common drivers of network-related outages, according to the Uptime Institute’s Data Center Resiliency Survey.

Uptime Institute

AI workloads are reshaping risk profiles

While the report stops short of directly attributing major outages to AI infrastructure, Uptime analysts warn that AI-driven density and power demands could introduce new operational risks over time.

High-density GPU clusters create highly variable power consumption patterns that can stress cooling systems, generators, and electrical infrastructure, analysts said during the webinar. Daniel Bizo, principal analyst at Uptime Intelligence, said synchronized power fluctuations across large AI training environments could create challenges during failover events if loads are not properly stabilized.

“If that load volatility is not dampened by some technique, capacitance, batteries or software, then the generators would be highly stressed,” Bizo said.

The report also highlights the growing operational complexity tied to software-based resiliency strategies. Distributed resiliency and automated failover can reduce the impact of single-site failures, but they also introduce synchronization challenges and software-layer vulnerabilities that may be harder to detect, according to Uptime.

Fiber failures and external dependencies gain prominence

One of the report’s strongest themes is that outage risks are increasingly originating outside the data center itself.

Uptime found that outages tied to fiber and connectivity incidents more than doubled compared with historical averages tracked between 2020 and 2025. “We have much more interconnected digital architectures,” Bizo said during the webinar. “If anything goes wrong, that can have cascading impacts.”

Over the past decade, Uptime has found that third-party providers, including cloud, telecommunications, and colocation companies, account for the majority of publicly reported outages. Telecommunications-related outages, in particular, rose significantly, from 29 in 2020 to 39 in 2025, reflecting increased exposure to weather events, accidental damage, and geopolitical instability across wide-area connectivity infrastructure.

The findings reinforce a growing challenge for enterprise networking teams: many critical outage risks now sit outside the controlled boundaries of the data center, according to Uptime.

Human error continues to drive outages

Human error is consistently a major factor behind data center and IT service outages.

Uptime found that 92% of operators said human error was at least a minor contributor to significant outages experienced over the past three years. According to the Uptime Institute Data Center Resiliency Survey 2026, respondents said human error contributed to outages in the following ways: major contributor (31%), moderate contributor (31%), minor contributor (30%), and not a contributor (8%).

Failure to follow established procedures Is the leading driver of human error-related outages, according to the Uptime Institute’s Data Center Resiliency Survey.

Uptime Institute

Uptime analysts said operators should focus on operational discipline, simplified emergency procedures, and routine drills designed to simulate real-world outage scenarios.

“We talk about how we drill our staff. When you think about emergency services, how often they do preparation drills for events? When an event happens, they all know what they need to do, where they need to be, and what needs to happen,” Villegas-Williamson said. “That is the level of drilling instruction that site teams need to have, so that when an issue does occur, they’ve already been through this process.”

Discussion in the ATmosphere

Loading comments...