{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreib4vtoijp6zwqpyw5jkabeaynr7bdh2i3o7stqf6vkhtztna2p6eu",
    "uri": "at://did:plc:qzjwstutqk2cy7df7jbzd2hx/app.bsky.feed.post/3mide2mty2362"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreicy3rged5h7ub3nrx7mpgml2uunihxg7ssoy7voy6ob3lg4uxiqha"
    },
    "mimeType": "image/jpeg",
    "size": 3879651
  },
  "path": "/article/4151356/network-and-storage-patterns-for-ai-workloads-the-overlooked-bottleneck.html",
  "publishedAt": "2026-03-30T09:00:00.000Z",
  "site": "https://www.networkworld.com",
  "tags": [
    "Artificial Intelligence, Data Center, Enterprise Storage, Network-Attached Storage",
    "IBM",
    "MLCommons",
    "vLLM",
    "NVIDIA",
    "IBM’s AI storage",
    "The Kubernetes device plugin framework",
    "IBM AI Storage",
    "Kubernetes device plugins",
    "Want to join?"
  ],
  "textContent": "I used to think AI performance was mostly a GPU problem.\n\nThen I watched a “healthy” GPU fleet crawl. Not because we ran out of compute, but because we ran out of movement. Tokens waiting on data. GPUs waiting on batches. Services waiting on east-to-west traffic. Storage queues quietly turning into tail latency.\n\nToday, I do not even call this a storage problem. It is an information supply chain problem. In real enterprise AI, data is scattered across on-prem, cloud and edge footprints. Training and inference cycles get longer. Expensive resources like GPUs stay scarce. And the system pays a time tax every time data has to hop, copy, translate or wait. IBM frames AI storage in this same “supply chain” reality, especially as organizations modernize for distributed data and AI at scale.\n\nIf you are running AI in production, especially LLM inference and retrieval augmented generation (RAG), the network and storage layer is where “it works” becomes “it works reliably at scale.”\n\nThis is my field guide to the patterns that matter, the metrics that expose bottlenecks quickly and the open-source tools that can help you fix them.\n\n## The metric shift: From averages to tail latency\n\nTraditional infrastructure teams love averages. AI punishes that mindset.\n\nFor LLM inference, user experience is governed by two numbers:\n\n  * **Time to first token (TTFT):** How long users wait before they see the first token.\n\n\n  * **Time per output token (TPOT):** How smoothly tokens stream after the first one.\n\n\n\nMLCommons uses TTFT and TPOT in its LLM inference benchmarking rules because they reflect what users feel, not what a mean value hides.\n\nOnce you track TTFT and TPOT in percentiles (p95 and p99), the network and storage layer stops being “someone else’s problem” and becomes an architectural priority.\n\n## Two traffic shapes, two different bottlenecks\n\nMost enterprise AI systems fall into two traffic shapes that break different things.\n\n### Shape 1: Training and batch analytics\n\n  * Big sequential reads and writes\n\n\n  * Dataset shuffles and checkpoints\n\n\n  * Distributed training traffic across nodes\n\n\n\nThis is bandwidth hungry. Parallelism and throughput matter. Latency is often less visible than in interactive workloads, but when training stretches from days to weeks, it is frequently a data path problem.\n\n### Shape 2: Inference and RAG\n\n  * Bursty request patterns\n\n\n  * Many small reads (vector search, metadata, prompt artifacts)\n\n\n  * High fan-out and fan-in across services\n\n\n  * Tail latency dominates\n\n\n\nMost CIO conversations I have are about inference, because that is where customer experience, employee productivity and revenue workflows live. That means the architecture should be optimized for consistency, not just peak throughput.\n\n## Three failure modes I see constantly\n\n### 1) GPUs look busy, but they are not productive\n\nI have seen GPU utilization in the 60 to 80 percent range while tokens per second stayed flat and queues kept growing. The system looked “loaded,” but it was not delivering more outcomes.\n\nIn practice, the fix is often not “more GPUs.” It is better batching and memory management in the serving layer, so GPUs spend more time generating tokens and less time context switching or waiting for fragmented work.\n\nServing engines like vLLM are useful here because they treat inference performance as a tunable discipline. You can tune batching, scheduling and memory behavior to balance throughput with TTFT and TPOT under real concurrency.\n\n**Pattern I rely on:** Separate the front door (API gateway, auth, rate limits) from the batching brain (LLM serving engine). Optimize for TTFT and TPOT, not just concurrency.\n\n### 2) East-to-west traffic quietly eats your latency budget\n\nRAG workloads are network hungry. A single prompt can trigger:\n\n  * embedding lookup\n\n\n  * vector search\n\n\n  * metadata fetch\n\n\n  * document chunk fetch\n\n\n  * rerank\n\n\n  * prompt assembly\n\n\n  * LLM call\n\n\n\nEven if each hop is “fast on average,” the p99 gets ugly under load because the pipeline is chatty and synchronous. The system starts to feel like the model is slow when the real issue is that your request spends too much time traveling.\n\n**Pattern I rely on:** Collapse hops where possible, co-locate latency sensitive services and treat network round trips as a scarce resource. A simple rule I use is this: Do not let your p99 depend on a long chain of synchronous calls.\n\n### 3) Storage becomes the hidden queue\n\nIn inference systems, storage rarely looks saturated at the device level. The problem is usually the data path: Too many copies, too much CPU involvement and too many small metadata operations that show up as tail latency.\n\nI like to explain the principle using GPUDirect Storage, even if you do not implement it. NVIDIA describes GPUDirect Storage as enabling a more direct DMA path between storage and GPU memory, reducing CPU overhead and latency by avoiding extra copies.\n\nYou do not need that exact technology to benefit from the lesson.\n\n**Pattern I rely on:** Make the data path boring. Fewer copies. Fewer layers. Fewer handoffs.\n\n## Unified data services beat siloed performance wins\n\nI have watched teams chase a 20% performance gain in one tier while ignoring the bigger issue: data fragmentation.\n\nIf your AI pipeline bounces across disconnected file, object and block systems, you keep paying the hop tax. You also increase the chance that “the right data” is not where the model expects it to be.\n\nIBM’s AI storage framing is helpful because it emphasizes unified storage approaches that consolidate file, block and object services while integrating with existing investments, to deliver data at scale with low latency.\n\nTranslated into an enterprise goal, this means fewer copies, fewer bridges and fewer places where tail latency can hide.\n\n## Content-aware storage and RAG: An underused lever\n\nHere is a point that does not get enough attention. RAG is not only about models and vector databases. It is also about whether your enterprise can make unstructured data retrievable without turning the data estate into a copy machine.\n\nIBM notes that very little enterprise data is used to train the large language models behind assistants, limiting business value and highlights “content-aware” approaches that extract semantic meaning from unstructured data so assistants can answer more intelligently.\n\nI like this framing because it shifts the conversation from “store more data” to “make data usable where it already lives.” That is often the difference between a RAG system that scales and one that becomes a governance and cost problem.\n\n## The metrics I track now (and why they work)\n\nWhen I am asked what to measure, I keep it simple. I want metrics that map to user experience and capacity decisions.\n\n### Inference experience\n\n  * TTFT p95 and p99\n\n\n  * TPOT p95 and p99\n\n\n  * Tokens per second per GPU\n\n\n  * Queue time before execution\n\n\n\nMLCommons is a good anchor here because TTFT and TPOT are benchmarked precisely to capture user-visible behavior.\n\n### Network health\n\n  * Service-to-service latency p95 and p99\n\n\n  * Retransmits and packet loss\n\n\n  * East to west throughput per node\n\n\n  * Queue depth in the network path during peak load\n\n\n\n### Storage health\n\n  * Read latency p95 and p99\n\n\n  * IOPS and bandwidth at the namespace or volume level\n\n\n  * Cache hit rates\n\n\n  * Metadata operation rate and latency (the sleeper issue)\n\n\n\n### System efficiency\n\n  * GPU active time vs waiting time\n\n\n  * CPU utilization and softirq time on serving nodes\n\n\n  * Fan-out per prompt and per request type\n\n\n\n## Two real-world use cases (with quantified outcomes)\n\nThese examples reflect patterns I have seen repeatedly. The numbers are representative and meant to show the shape of the problem, not promise identical results in every environment.\n\n### Use case 1: RAG assistant that “felt slow” even with plenty of GPU\n\n**Symptoms**\n\n  * TTFT p95 drifted from about 0.7s to about 2.2s during peak hours\n\n\n  * TPOT p95 stayed acceptable, but the first response felt delayed\n\n\n  * GPU utilization looked fine, but queue time rose steadily\n\n\n\n**Root cause**\n\n  * Vector search and chunk retrieval created bursty east to west traffic\n\n\n  * Too many synchronous hops and too little caching of hot content\n\n\n  * Network tail latency amplified fan-out\n\n\n\n**Fix pattern**\n\n  * Co-located vector search and document store for hot shards\n\n\n  * Cached top-k retrieved chunks and prompt templates\n\n\n  * Added asynchronous retrieval and progressive context loading for long documents\n\n\n\n**Outcome**\n\n  * TTFT p95 returned near baseline under similar user load\n\n\n  * Fewer p99 spikes because the pipeline depended on fewer synchronous calls\n\n\n  * Modest improvement in tokens per second because fewer requests stalled on I/O\n\n\n\n### Use case 2: Adding GPUs did not improve throughput\n\n**Symptoms**\n\n  * Tokens per second increased only about 10 percent after adding 25 percent more GPUs\n\n\n  * TPOT p99 worsened under concurrency\n\n\n  * CPU utilization spiked on serving nodes\n\n\n\n**Root cause**\n\n  * Serving layer batching and memory churn wasted GPU cycles\n\n\n  * Storage path added extra copies and CPU overhead for artifacts\n\n\n  * Scheduling placed workloads on nodes without the right NIC or storage locality\n\n\n\n**Fix pattern**\n\n  * Tuned the serving engine to match request size distribution and concurrency behavior (vLLM tuning is a good reference point for this type of work)\n\n\n  * Improved device-aware placement using Kubernetes device plugin patterns so specialized hardware is advertised cleanly to the scheduler\n\n\n  * Reduced CPU bounce buffering behavior in the data path where feasible\n\n\n\nThe Kubernetes device plugin framework is the simple building block behind making “specialized resources” schedulable at scale.\n\n**Outcome**\n\n  * More linear scaling as GPUs were added\n\n\n  * Stabilized TPOT p99 because fewer requests were blocked behind slow neighbors\n\n\n  * Reduced CPU overhead, freeing headroom for networking and observability\n\n\n\n## Open source that fits these patterns\n\nYou can implement most of these improvements using open-source components:\n\n  * **Observability:** Prometheus, Grafana, OpenTelemetry and eBPF-based tooling to see flow-level latency and fan-out.\n\n\n  * **Caching:** Redis for hot key/value caching; local NVMe caches for hot artifacts.\n\n\n  * **Serving:** vLLM for configurable batching and memory behavior under load.\n\n\n  * **Scheduling:** Kubernetes device plugins and resource-aware node pools for GPU and NIC locality. (Kubernetes device plugins:)\n\n\n  * **Storage:** Ceph is a common open-source option for software-defined block, file and object patterns. IBM also calls out IBM AI Storage Ceph as an open source, software-defined approach aligned to these needs.\n\n\n\n## Limitations and tradeoffs\n\nEvery performance win has an operational cost. These are the tradeoffs I plan for.\n\n  1. Caching improves consistency, but invalidation is hard. Freshness, permissions and compliance requirements complicate “simple” caches.\n\n\n  2. Device-aware scheduling improves performance, but increases complexity. You introduce Kubernetes device plugins, operators and topology awareness. It is worth it, but it must be managed.\n\n\n  3. Reducing copies can improve latency, but raises platform constraints. Direct data paths reduce CPU overhead, but they come with configuration and compatibility requirements.\n\n\n  4. Unifying data services reduces silos, but consolidation needs governance. A unified approach can reduce hop tax, but only if access control, lifecycle policies and ownership are clear.\n\n\n\n## Future scope: What will matter more next\n\nOver the next 12 to 24 months, I expect four themes to grow:\n\n  * **AI SLOs become standard:** TTFT and TPOT become operational targets, not just benchmark terms.\n\n\n  * **Workload placement becomes policy-driven:** Placement logic becomes strategic, spanning hybrid footprints.\n\n\n  * **More GPU-centric data paths:** Fewer CPU copies and less context switching where possible.\n\n\n  * **RAG becomes “information supply chain” first:** Content-aware approaches and unified data services reduce re-copying and re-governing the same data.\n\n\n\n## What I would tell a CIO in an elevator pitch\n\nIf you want AI to feel fast and reliable, stop treating it like a model deployment and start treating it like a distributed system with strict tail latency expectations.\n\nMeasure TTFT and TPOT in percentiles. Map your pipeline fan-out. Make network and storage visible. Then apply disciplined patterns: Isolate lanes, cache aggressively, schedule intelligently, reduce copies in the data path and unify data services where it makes sense.\n\nYour GPUs will thank you, but more importantly, your users will.\n\n**This article is published as part of the Foundry Expert Contributor Network.\n** Want to join?",
  "title": "Network and storage patterns for AI workloads: The overlooked bottleneck"
}