External Publication
Visit Post

How Cisco IT cut observability costs by 86% and eliminated major network outages

Network World [Unofficial] June 5, 2026
Source

When several database clusters started failing simultaneously, Cisco IT had all the data it needed to diagnose the problem. The signals were there. Engineers saw them. The issue was that those signals were landing in separate systems that did not talk to each other, and the team had no way to correlate them in real time.

What followed was three hours of war-room calls across three separate bridges. Engineers were on one call, debating ownership of the problem. Application owners were on another, waiting on the database to recover. Executives were on a third, trying to explain to business partners why users could not place orders. The root cause was eventually found, but the outage had already hit.

That incident became the impetus for a consolidation project that Anusha Nataraj, product manager in Cisco IT’s observability team, detailed in a session at Cisco Live.

The project has since reduced major incidents by 25% and produced zero major network outages over the last six quarters. The environment spans more than 1,500 applications, more than 71 of them externally facing, across more than 100,000 endpoints, processing more than 15,000 changes per month. The platform at the center of that consolidation is Splunk, which Cisco acquired in 2024. Cisco IT is now running its own product across its global infrastructure.

“We had the data, we had all the data, but [it’s] just that it was not stitched together, and we couldn’t see it all holistically,” Nataraj said.

The tool sprawl that made it worse

The pre-consolidation observability environment at Cisco IT was not a single gap. It was a collection of them. Logs were split across a partial Splunk deployment and Elastic instances. Metrics ran across Prometheus stacks, Grafana stacks and homegrown solutions. Event management ran on a separate homegrown platform. None of these systems fed into each other.

The team had considered staying with the existing mix, including Datadog and Elastic, and evaluated stitched-together open source alternatives. Three factors drove the decision against them. They could not scale to Cisco IT’s operational requirements, they lacked the AI capabilities the team needed, and they offered no roadmap that Cisco IT could shape as a customer.

“They worked at our department level, but they couldn’t scale to our IT needs, and they didn’t have the maturity of AI that we were expecting,” Nataraj said.

Nataraj was clear that the decision was not driven by the 2024 acquisition. The team evaluated Splunk against their requirements and selected it on fit, scale and AI roadmap.

The three-pillar consolidation

The consolidation followed a defined three-step sequence.

  • Log consolidation : All logs were moved to Splunk Cloud, retiring Elastic and other logging instances in the process.
  • Metrics consolidation: Currently in progress, with Prometheus, Grafana and homegrown stacks being retired as the work completes.
  • Business context via ITSI: The team is implementing IT Service Intelligence (ITSI) to add business context on top of the unified log and metrics data.

The 86% reduction in total cost of ownership for observability came out of that first phase. More than 400 on-premises servers were decommissioned along with their associated storage. Licenses across multiple platforms were consolidated. The contractor headcount assigned to monitoring those servers was reduced.

“We decommissioned a lot of servers that were on prem, which was more than 400 servers, and associated storage elements were all turned off, and that was a major savings for us,” Nataraj said.

What incident response looks like now

The operational change is most visible in how the team handles incidents. A video shown during the session walked through the current workflow.

When an alert fires in ITSI, one click launches a custom-built AI agent that queries logs, metrics, traces, topology data, and recent change requests in real time. The agent returns a plain-language summary of what broke, why it broke and how to fix it. Role-specific actions are included for DevOps, application and SRE teams. If escalation is needed, the agent drafts a handoff for the on-call engineer. The whole investigation happens in a single screen before an incident ticket is even created.

The result is a measurable shift in outcome. When issues do occur, the three-bridge war room is gone. Teams can see where the problem is and the response is contained to the people who need to act on it. “We have actually brought down our incident count by 25%, and in the last six quarters there have been no major network outages,” Nataraj said.

Lessons learned

Nataraj laid out a practical set of takeaways from the project for IT operations teams running at similar scale.

  • Unify data before applying AI. Without a unified data platform, AI has nothing reliable to work with. Getting all data into a single architecture has to come first.
  • Share visibility across teams. Correlating data is only useful if the teams who need it can access it. The team built cross-domain data sharing from the start.
  • Bring change and release data into observability. Tying change management records to observability data lets the team trace failures back to the specific change that caused them and maintain a rollback plan.
  • Treat cost savings as the budget for innovation. The TCO reduction funded the team’s shift away from routine monitoring. Engineers who were previously managing capacity and watching servers are now building AI agents on top of Splunk’s MCP tools, participating in alpha and beta testing for new Splunk tooling, and feeding product feedback directly to Cisco’s Splunk teams.

“They were purely ticket closers before,” Nataraj said. “They’re innovators, they wear product managers’ hats, and they are really happy about the work that they do.”

Job satisfaction, retention, and contractor reduction are all outcomes Nataraj cited as measurable ROI from the project. “Keeping the team motivated and having them feel happy is a real ROI for every single organization,” she said.

Read more stories from Cisco Live 2026

  • Cisco Live: The network is back, and AI rewrote the rules
  • Cisco sees quantum networking as the future of networking
  • What is Cisco Cloud Control and why should customers care?
  • Cisco brings agentic ops platform and security overhaul to Cisco Live

Discussion in the ATmosphere

Loading comments...