Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibv3txbc5pnxsponfjvk65fi7lh42hwoqknqxladmb6xlrla43lwa",
    "uri": "at://did:plc:qzjwstutqk2cy7df7jbzd2hx/app.bsky.feed.post/3mm7gvolsroj2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreidqqpyurre6cfdy3ha2uxj527fq67h3xkbpv6m3rpb6cssslhdwy4"
    },
    "mimeType": "image/jpeg",
    "size": 5517037
  },
  "path": "/article/4172562/how-ai-is-transforming-network-incident-response-and-where-it-still-falls-short.html",
  "publishedAt": "2026-05-19T09:00:00.000Z",
  "site": "https://www.networkworld.com",
  "tags": [
    "Incident Response, Network Security, Security, Security Practices",
    "Broadcom’s 2026 State of Network Operations report",
    "route-sherlock",
    "Cloudflare 1.1.1.1 outage in July 2025",
    "Want to join?"
  ],
  "textContent": "If you’ve sat through any vendor pitch in the last year, you’ve heard the promise. AI will detect the anomaly, correlate the signals, identify the root cause, maybe even remediate it. The autonomous network operations center is just around the corner.\n\nI’ve spent close to a decade building anomaly detection and telemetry systems at scale, and I think that promise is partly true, partly aspirational and partly misleading. The reality is messier. AI is genuinely helping in a few specific places, and it’s nowhere close to delivering in others, mostly for reasons that have nothing to do with model quality.\n\nThe biggest reason: we still can’t see most of what’s happening on our own networks.\n\n## The visibility problem comes first\n\nNetwork operators love to talk about observability. The actual state of observability in 2026 is much less impressive than the marketing suggests. According to Broadcom’s 2026 State of Network Operations report, 95% of IT professionals report lacking visibility into network segments, especially in public cloud. Only 49% believe their network can support the bandwidth and latency that AI workloads need.\n\nThat tracks with what I see in practice. At the scale I’ve worked, you typically have SNMP polling every few minutes, syslog arriving with variable delay and streaming telemetry on a fraction of your interfaces. Even with all of that running, the blind spots are enormous. Anything traversing a third-party network is invisible. Specific paths through ECMP fabrics are often uninstrumented. The state of a BGP decision process on a router three hops away usually require SSH and a manual look.\n\nTraceroute, still one of the most-used diagnostic tools, has well-known limits. It can’t see through firewalls that block ICMP. It doesn’t handle asymmetric routing well. It gives you a snapshot of a path that may have already changed by the time you read the output. Load balancing distorts the result. Muted interfaces look identical to real packet loss. None of this is news to anyone who has run a traceroute and squinted at the output.\n\nThe implication for AI is direct. AI can only reason about data it has. If 30% of the relevant paths during an incident are uninstrumented, no model is going to make up for that. Better algorithms operating on the same partial data give you marginally better partial answers.\n\n## Where AI is actually pulling its weight\n\nThat said, AI is delivering real value in network operations today, just narrower than the marketing implies. Three areas in particular.\n\n  * **Anomaly detection at scale.** This is the most concrete win. The basic idea is simple: instead of statically thresholding every counter (“alert when CRC errors > 100”), you compare each device against its own history and against its peers. If a router’s error rate has crept up, and the dozen other routers running the same model with the same role haven’t, that device gets flagged. Across hundreds of millions of data points a day on tens of thousands of devices, no human is going to spot that drift. This is where statistical methods earn their keep, and it’s the part of “AI in network ops” that has been quietly working for a while.\n  * **Alert correlation.** During a major incident, hundreds of alerts fire at once. Most are symptoms, not causes: an interface goes down and you get the interface alarm, the BGP session alarm, the BFD alarm, downstream prefix alarms, plus every monitoring system’s reachability alarm for things behind that interface. AI can group these by topology and timing and collapse hundreds of signals into a handful of clusters. It still doesn’t tell you the root cause, but it stops you from having to mentally filter the noise. The difference between staring at 300 alerts and looking at five clusters is the difference between a long war room and a focused one.\n  * **Pulling context together.** This is the newer area where LLMs are starting to be useful. During an investigation, an engineer typically bounces between several tools: route collectors, looking glasses, syslog viewers, monitoring dashboards, ticketing, the CLI of whatever device is misbehaving. Assembling the picture is slow and mostly copy-paste. I’ve been working on an open-source tool called route-sherlock that pulls together BGP routing data, looking glass output and historical context, and uses an LLM to write a plain-language summary of what might be going on. It doesn’t do the investigation. It compresses the first 30 minutes of context-gathering, so you start with a hypothesis instead of an empty terminal.\n\n\n\nNotice what these have in common. They all augment a human investigation. They speed up the parts that were already mechanical. None of them replace judgment.\n\n## What AI still can’t do\n\nThe harder boundary is causation in novel scenarios. Networks fail in ways that depend heavily on context. A BGP route leak looks different depending on whether it’s coming from a peer, a transit provider or an internal confederation. A fiber cut in one geography cascades differently than an identical cut elsewhere because of how traffic was pre-engineered. The same ECMP hash that worked for years can produce a microloop after a seemingly unrelated topology change.\n\nThe Cloudflare 1.1.1.1 outage in July 2025 is a useful example. For about an hour, Cloudflare’s public DNS resolver was unreachable. From the outside, it looked like a BGP hijack: another network appeared to be announcing Cloudflare’s prefixes. The actual cause was internal. A configuration error caused Cloudflare’s routes to disappear from the global routing table, and what looked like hijacking was an artifact of the withdrawal exposing the address space.\n\nA pattern-matching system would have flagged the BGP anomaly and probably labeled it a hijack. The label would have been wrong. The real story required understanding the operational intent behind a config change, and that kind of context simply isn’t in the telemetry.\n\nThis is the core problem. AI models learn from historical patterns. The most damaging incidents are usually the ones that don’t look like anything in the training data. When an experienced engineer troubleshoots, they’re reasoning about protocol behavior, vendor implementation quirks, traffic engineering intent and recent organizational changes (“we touched the peering policy last month, could that be related?”). That mental model of the network isn’t something current AI systems have.\n\n## Better data first, better models second\n\nThe current industry instinct is to throw bigger AI models at network operations. I think that’s the wrong order of operations. The bottleneck isn’t model sophistication, it’s coverage. A more sophisticated model analyzing 70% of the relevant data will still miss things rooted in the 30% you can’t see.\n\nThe investments that pay off are the unglamorous ones. Expanding streaming telemetry. Instrumenting cloud-to-cloud paths. Correlating network signals with application-layer data. Running continuous synthetic probes so you’re not waiting for a customer ticket to discover a blind spot. None of it is as exciting as a generative AI demo, but it’s the part that has to come first.\n\nAI’s role here is real and growing. It processes telemetry at scale, suppresses alert noise, surfaces outliers and pulls context together across tools. Those capabilities genuinely cut mean time to detect and mean time to resolve. But they augment the engineer’s investigation. They don’t replace it, and they won’t until we can actually see what’s happening on our networks end to end.\n\nIs AI transforming network incident response? Yes, but not the way the slides suggest. It’s not eliminating the 2 a.m. page. It’s making the hours after the page more focused. That’s a real improvement, and it’s worth investing in. The fully autonomous NOC remains a story for another decade, and it won’t arrive until the visibility gap closes.\n\n**This article is published as part of the Foundry Expert Contributor Network.**\n**Want to join?**",
  "title": "How AI is transforming network incident response (and where it still falls short)"
}