Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreig7hvyghr7pm5pfy5fo2xx6tpk5u7ern6wdipkohmzear5qvdikea",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnizbaq25wl2"
  },
  "path": "/t/datasets-streaming-bottlenecks-storage-client-memory-gc-pressure-vs-sharding-cache-request-amplification/176485#post_2",
  "publishedAt": "2026-06-04T23:56:36.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Streaming datasets: 100x More Efficient",
    "Datasets streaming docs",
    "Dataset vs IterableDataset",
    "IterableDataset / main classes docs",
    "HF Datasets streaming docs",
    "HF Hub dataset streaming guide",
    "WebDataset multinode docs",
    "MosaicML StreamingDataset",
    "StreamingDataset API docs",
    "HF Datasets main classes / IterableDataset docs",
    "Issue: IterableDataset distributed/DataLoader sharding performance traps",
    "Issue: streaming datasets doesn’t work properly with multi-node",
    "Issue: Problem in training iterable dataset",
    "Forum: Loading webdatasets across multiple nodes",
    "Loading webdatasets across multiple nodes",
    "Forum: How to use split_dataset_by_node and shuffle on IterableDataset",
    "Forum: Keeping IterableDataset node-wise split fixed during DDP",
    "PyTorch DataLoader docs",
    "TorchData Stateful DataLoader docs",
    "PyTorch issue: pin_memory / prefetch / persistent_workers memory behavior",
    "Forum: A streaming dataset’s memory footprint continually grows",
    "Issue: Memory leak when streaming",
    "huggingface_hub environment variables",
    "huggingface_hub cache management",
    "Datasets cache management",
    "Issue: HF_DATASETS_CACHE ignored?",
    "HF upload guide: local Xet cache on NVMe/SSD",
    "Environment variables: HF_XET_CACHE",
    "Hub local cache docs",
    "Using Xet storage",
    "HF upload guide",
    "s3fs issue: large file read, block size, cache type",
    "fsspec documentation",
    "s3fs documentation",
    "Datasets main classes / IterableDataset reference",
    "Hub local cache",
    "Upload files to the Hub / Xet cache guidance",
    "Forum: Speeding up Streaming of Large Datasets / FineWeb",
    "Issue: More easily support streaming local files",
    "Issue: Make Dataset streaming queries retryable",
    "WebDataset GitHub",
    "WebDataset PyPI notes",
    "MosaicML StreamingDataset GitHub",
    "MosaicML StreamingDataset API docs",
    "Databricks/MosaicML blog: StreamingDataset",
    "fsspec docs",
    "s3fs docs"
  ],
  "textContent": "Hmm… This isn’t based on personal experience, but since it doesn’t seem like a simple problem, it looks like it would be best to sort things out first. This looks like a pretty layered problem:\n\n* * *\n\n## Direct answer\n\nYes — I think this is a real pain point, but I would not frame it as one single bottleneck or as one generic “storage client problem”.\n\nFor large-scale training, the bottleneck can absolutely come from the data access layer rather than raw network bandwidth. But in practice, “data access layer” usually means several interacting pieces:\n\n\n    dataset layout\n      -> sharding / rank-node-worker assignment\n      -> DataLoader prefetch / multiprocessing / pinned memory\n      -> Hub/Xet or S3/GCS/Azure/fsspec/FUSE transport\n      -> local cache placement: NVMe/SSD vs NFS/shared filesystem\n      -> training topology: world_size, nodes, workers, epochs\n\n\nSo my short answer would be:\n\nQuestion | Practical answer\n---|---\nIs this a real pain point? | Yes, especially for multi-worker or multi-node training, large Parquet/WebDataset-style corpora, object storage, and shared filesystems.\nIs it just raw bandwidth? | Usually no. Request patterns, shard layout, cache locality, DataLoader behavior, and memory pressure can dominate.\nIs request amplification still the main issue? | Not always. HF has already improved the Hub streaming path significantly, including data-file resolution, request reduction, Parquet prefetching, and buffering. See Streaming datasets: 100x More Efficient.\nIs memory/RSS growth real? | There are reports of it, but I would not immediately blame only `datasets`. PyTorch DataLoader workers, prefetching, pinned memory, transforms, decoding, Arrow/Parquet readers, and fsspec caches can all contribute.\nIs sharding important? | Very important. Bad alignment between shard count and `world_size` can cause skipped samples, duplicated reads, poor locality, or uneven work.\nWould a Rust streaming/cache layer help? | Maybe, but only if it solves topology-aware shard scheduling, bounded prefetch/memory, local cache locality, and observability. “Rust” alone is not the product; the useful thing would be a clearer data-plane contract.\nWhat should users do today? | Use topology-friendly shards, split by node/rank explicitly, keep cache on local NVMe/SSD, tune DataLoader conservatively, and materialize heavy preprocessing when needed.\n\nSo I would describe the minimum useful integration not as “a faster downloader”, but as something closer to:\n\n> a topology-aware streaming data plane for training.\n\nThat would include:\n\n  * shard planning aware of `world_size`, ranks, nodes, and DataLoader workers\n  * node-local cache locality across epochs\n  * bounded-memory prefetching\n  * clear interaction with PyTorch DataLoader\n  * support for Hub/Xet and object-storage-like backends\n  * diagnostics for skipped samples, duplicate reads, cache hits, remote bytes, request count, worker RSS, and per-rank throughput\n\n\n\nWithout those pieces, a lower-level cache or storage client might improve one workload while leaving the main distributed-training failure mode unresolved.\n\n* * *\n\n## 1. Why this is not just “streaming is slow”\n\nFor small or exploratory use, `streaming=True` is often exactly the right tool: it lets you iterate over large datasets without downloading everything first. The official docs describe dataset streaming as a way to work with a dataset as you iterate over it, instead of fully downloading it first:\n\n  * Datasets streaming docs\n  * Dataset vs IterableDataset\n\n\n\nBut for large-scale training, especially with many DataLoader workers, multiple ranks, multiple nodes, remote object storage, or shared filesystems, the bottleneck is often not simply “network bandwidth”.\n\nA useful first-pass model is:\n\n\n    streaming performance = layout + sharding + DataLoader + cache + storage + topology\n\n\nThis is why the same dataset can feel fine in one setup and painful in another:\n\nSetup | Likely behavior\n---|---\nsingle process, local cache, simple iteration | usually easy to reason about\none GPU, remote Parquet, moderate workers | often workable, but DataLoader tuning matters\nmulti-GPU single node | shard splitting and batch consistency start to matter\nmulti-node DDP | rank/node-aware sharding and cache locality become first-class\nobject storage + many small files | request count and metadata/listing overhead can dominate\nNFS/shared cache | cache contention and filesystem behavior can dominate\nheavy transform/decode in workers | CPU/RSS/DataLoader behavior can dominate\n\n* * *\n\n## 2. Important context: some older request-amplification issues have changed\n\nOne reason this topic is delicate is that Hugging Face has already improved part of the streaming stack.\n\nThe official post Streaming datasets: 100x More Efficient describes improvements such as:\n\n  * 100x fewer requests\n  * 10x faster data resolution\n  * 2x sample/sec\n  * 0 worker crashes at 256 concurrent workers\n  * better data-files caching across DataLoader workers\n  * Parquet prefetching\n  * more configurable buffering\n\n\n\nSo I would avoid a blanket claim like:\n\n> “HF streaming causes request storms.”\n\nA safer statement is:\n\n> Some request-amplification paths have been improved, but large-scale training can still suffer when dataset layout, sharding, cache placement, and DataLoader concurrency are not aligned.\n\nThat distinction matters. The remaining pain is probably less about “plain download is slow” and more about **distributed coordination**.\n\nRelevant links:\n\n  * Streaming datasets: 100x More Efficient\n  * Datasets streaming docs\n  * IterableDataset / main classes docs\n\n\n\n* * *\n\n## 3. Dataset layout matters more than it first appears\n\nA streaming dataset is not just “a dataset stored remotely”.\n\nThe physical layout can determine:\n\n  * how many remote requests are needed\n  * how many shards each rank must open\n  * whether shards can be assigned cleanly to nodes\n  * whether cache locality is possible\n  * how much data is skipped\n  * whether prefetch helps or hurts\n  * whether one worker can keep a GPU fed\n\n\n\nImportant layout variables:\n\nVariable | Why it matters\n---|---\nNumber of shards | Too few shards can limit parallelism; too many tiny shards can increase request and metadata overhead\nShard size | Huge shards can reduce scheduling flexibility; tiny shards can increase overhead\nParquet row group size | Affects range-read granularity and filtering/projection behavior\nCompression | Can move the bottleneck from network to CPU\nFile count | Many small files can be painful on object storage\nBalance across shards | Uneven shards can cause per-rank imbalance\nDivisibility by `world_size` | Important for avoiding wasteful skip behavior\nFormat | Parquet, Arrow, WebDataset tar shards, MDS, raw JSONL, image/audio files all behave differently\n\nA practical rule:\n\n> If you control the dataset layout, design shards for the training topology, not only for storage convenience.\n\nFor large training jobs, I would prefer:\n\n  * enough shards to distribute across ranks/workers\n  * shard counts divisible by common `world_size` values\n  * reasonably balanced shard sizes\n  * Parquet/WebDataset/MDS-style layouts over many tiny raw files\n  * predictable row-group or shard boundaries\n  * avoiding a single huge file unless the reader can split it well\n\n\n\nRelevant links:\n\n  * HF Datasets streaming docs\n  * HF Hub dataset streaming guide\n  * WebDataset multinode docs\n  * MosaicML StreamingDataset\n  * StreamingDataset API docs\n\n\n\n* * *\n\n## 4. The `num_shards` vs `world_size` issue is probably central\n\nOne of the most important practical questions is:\n\n> Does the dataset shard cleanly across the distributed training topology?\n\nThe HF docs for iterable datasets describe a distinction that is easy to miss:\n\n  * if the dataset has a number of shards that divides evenly across `world_size`, shards can be assigned more cleanly\n  * otherwise, each node may keep only a fraction of examples and skip the others\n\n\n\nThat means a bad shard topology can turn into:\n\n  * unnecessary reads\n  * wasted remote traffic\n  * low cache hit rate\n  * poor locality\n  * uneven work across ranks\n  * slow epoch boundaries\n\n\n\nA practical mental model:\n\n\n    good case:\n      rank/node reads mostly the shards it will actually train on\n\n    bad case:\n      rank/node touches lots of data, then discards/skips much of it\n\n\nFor small datasets, this may be acceptable.\n\nFor 10 TB, 100 TB, or multi-node training, this can become the whole problem.\n\nRelevant links:\n\n  * HF Datasets main classes / IterableDataset docs\n  * Issue: IterableDataset distributed/DataLoader sharding performance traps\n  * Issue: streaming datasets doesn’t work properly with multi-node\n  * Issue: Problem in training iterable dataset\n  * Forum: Loading webdatasets across multiple nodes\n\n\n\n* * *\n\n## 5. `split_dataset_by_node()` helps, but it is not a universal fix\n\nFor multi-node training, `split_dataset_by_node()` is an important tool.\n\nThe forum answer in Loading webdatasets across multiple nodes is especially useful because it states the practical distinction clearly:\n\n  * by default, each node may stream the full dataset and skip samples to avoid duplicates\n  * `split_dataset_by_node()` can assign shards to each node so each node streams only what it uses\n\n\n\nThat is a big difference.\n\nBut it does not mean every distributed streaming problem disappears.\n\nThings that still need care:\n\nIssue | Why it matters\n---|---\nShard count not divisible by world size | Can still cause unevenness or skipping\nShuffle behavior | Shuffling can fight cache locality\nEpoch boundaries | Some reports mention hangs or crashes after an epoch\nBatch size consistency | DDP/gradient sync can dislike uneven batches\nWorker-level splitting | Node-level splitting is not always enough\nResumption | Mid-epoch restart can be expensive if ordering is not deterministic\nCache reuse | A split that changes every epoch can reduce cache locality\n\nUseful links:\n\n  * Forum: How to use split_dataset_by_node and shuffle on IterableDataset\n  * Forum: Keeping IterableDataset node-wise split fixed during DDP\n  * Forum: Loading webdatasets across multiple nodes\n  * Issue: streaming datasets doesn’t work properly with multi-node\n\n\n\n* * *\n\n## 6. DataLoader settings can make a streaming issue look like a memory leak\n\nIf RSS grows during streaming, I would not immediately conclude:\n\n> “HF Datasets has a memory leak.”\n\nIt might be true in some cases, but it is not the first conclusion I would jump to.\n\nA streaming pipeline often involves:\n\n\n    remote read\n      -> decompression / decoding\n      -> Python objects\n      -> Arrow / Parquet reader\n      -> map/filter/transform\n      -> DataLoader worker process\n      -> prefetch queue\n      -> collate_fn\n      -> pinned memory\n      -> GPU transfer\n\n\nAny of these can hold references or buffer data.\n\nImportant DataLoader knobs:\n\nKnob | Why it matters\n---|---\n`num_workers` | More worker processes can increase parallelism and memory use\n`prefetch_factor` | Each worker can prepare multiple batches ahead\n`pin_memory` | Puts fetched tensors in pinned host memory for faster GPU transfer\n`persistent_workers` | Keeps worker processes alive across epochs\ncustom `collate_fn` | Can accidentally retain objects or create large temporary structures\ndecode/transform in workers | Can move CPU and memory pressure into worker processes\n\nRelevant links:\n\n  * PyTorch DataLoader docs\n  * TorchData Stateful DataLoader docs\n  * PyTorch issue: pin_memory / prefetch / persistent_workers memory behavior\n  * Forum: A streaming dataset’s memory footprint continually grows\n  * Issue: Memory leak when streaming\n\n\n\nA good debugging baseline:\n\n\n    DataLoader(\n        dataset,\n        batch_size=<batch_size>,\n        num_workers=0,\n        pin_memory=False,\n        persistent_workers=False,\n    )\n\n\nThen add concurrency back one variable at a time:\n\n\n    # Step 1: no multiprocessing\n    num_workers = 0\n\n    # Step 2: one worker\n    num_workers = 1\n\n    # Step 3: several workers, low prefetch\n    num_workers = 4\n    prefetch_factor = 1\n\n    # Step 4: only then test pinning/persistence\n    pin_memory = True\n    persistent_workers = True\n\n\nIf memory only grows after `num_workers > 0`, it may be DataLoader multiprocessing, prefetch, transform, or worker lifetime.\n\nIf memory grows even with `num_workers=0`, the issue is more likely in dataset iteration, decoding, transformation, Arrow/Parquet, Python object retention, or user code.\n\n* * *\n\n## 7. Cache placement is part of the runtime contract\n\nFor large jobs, cache is not just a convenience.\n\nIt is part of the runtime design.\n\nRelevant Hugging Face cache variables:\n\nVariable | Role\n---|---\n`HF_HOME` | Base directory for Hugging Face local state\n`HF_HUB_CACHE` | Hub repository/file cache\n`HF_XET_CACHE` | Xet chunk cache\n`HF_DATASETS_CACHE` | Datasets cache files, not the same as Hub file cache\n\nLinks:\n\n  * huggingface_hub environment variables\n  * huggingface_hub cache management\n  * Datasets cache management\n  * Issue: HF_DATASETS_CACHE ignored?\n\n\n\nA common cluster problem is that `$HOME` points to NFS or another shared filesystem.\n\nThat can make defaults bad:\n\n\n    # maybe bad on clusters if $HOME is on NFS\n    ~/.cache/huggingface\n\n\nA safer pattern for large training jobs is usually:\n\n\n    export HF_HOME=/local_nvme/hf\n    export HF_HUB_CACHE=/local_nvme/hf/hub\n    export HF_XET_CACHE=/local_nvme/hf/xet\n    export HF_DATASETS_CACHE=/local_nvme/hf/datasets\n\n\nThe exact paths depend on the cluster, but the principle is:\n\n> Put high-churn caches on node-local SSD/NVMe when possible.\n\nHF’s upload guide explicitly recommends putting `HF_XET_CACHE` on a local disk such as NVMe/SSD when uploading from distributed filesystems like NFS, EBS, Lustre, or FSx.\n\nRelevant links:\n\n  * HF upload guide: local Xet cache on NVMe/SSD\n  * Environment variables: HF_XET_CACHE\n  * Hub local cache docs\n\n\n\n* * *\n\n## 8. Hub/Xet is a separate layer from Datasets iteration\n\nIt is also useful to separate:\n\n\n    Datasets iteration problem\n    vs\n    Hub transfer/cache problem\n\n\nThe Hub has moved toward Xet-backed storage and `hf_xet` for file transfer and chunk caching.\n\nThat means a dataset load can involve more than “plain HTTP file download”.\n\nBut not every Datasets streaming problem is an Xet problem.\n\nI would separate these cases:\n\nSymptom | More likely layer\n---|---\nCLI download stalls | Hub/Xet/network/cache\n`load_dataset(..., streaming=True)` stalls but CLI download works | Datasets iteration / reader / transform / DataLoader\nXet disabled changes behavior | Hub/Xet/cache path\nonly S3/GCS/FUSE path is slow | object storage client / filesystem abstraction\nonly multi-node training is slow | sharding/topology/cache locality\nonly `num_workers > 0` leaks memory | DataLoader / multiprocessing / prefetch\n\nPotential diagnostic toggles for Hub/Xet transfer paths:\n\n\n    # Current default path\n    hf download <repo_id> --repo-type dataset\n\n    # Xet high-performance mode\n    HF_XET_HIGH_PERFORMANCE=1 hf download <repo_id> --repo-type dataset\n\n    # Xet-disabled diagnostic path\n    HF_HUB_DISABLE_XET=1 hf download <repo_id> --repo-type dataset\n\n\nI would treat `HF_HUB_DISABLE_XET=1` as a diagnostic switch, not a universal permanent fix.\n\nRelevant links:\n\n  * Using Xet storage\n  * huggingface_hub environment variables\n  * HF upload guide\n\n\n\n* * *\n\n## 9. S3/fsspec/FUSE/object storage can be a hidden bottleneck\n\nIf the dataset is not on the Hub, there is another layer:\n\n\n    S3 / GCS / Azure Blob / custom object storage\n      -> fsspec / s3fs / gcsfs / adlfs\n      -> range reads / block cache / retry policy\n      -> Python reader / Parquet reader / DataLoader worker\n\n\nA path that looks like a normal filesystem may not behave like a normal filesystem.\n\nFUSE/object-store mounts can hide:\n\n  * range-request amplification\n  * poor random-read behavior\n  * slow stat/list operations\n  * file handle pressure\n  * local block cache growth\n  * retry storms\n  * credential refresh issues\n  * differences between sequential and random reads\n\n\n\nFor S3/fsspec specifically, block size and cache type can strongly affect both speed and memory.\n\nRelevant links:\n\n  * s3fs issue: large file read, block size, cache type\n  * fsspec documentation\n  * s3fs documentation\n\n\n\nA practical rule:\n\n> If changing only the storage backend changes the symptom, do not debug only `datasets`.\n\n* * *\n\n## 10. Direct remote streaming is not always the safest production path\n\nFor exploration, `streaming=True` is excellent.\n\nFor production-scale training, I would be more cautious.\n\nIf the job is expensive, multi-node, or long-running, it may be safer to stage data explicitly:\n\n\n    remote dataset\n      -> node-local NVMe / local shard cache\n      -> train from local shards\n\n\nor:\n\n\n    remote raw data\n      -> preprocessing job\n      -> materialized Parquet/WebDataset/MDS shards\n      -> training job\n\n\nThis is not as elegant as direct streaming, but it is often easier to debug and reproduce.\n\nWorkaround options:\n\nIf you see… | Consider…\n---|---\nlow GPU utilization | increase DataLoader concurrency carefully; profile CPU/decode/storage\nhigh RSS | reduce workers/prefetch/pinning; remove transforms; test `num_workers=0`\nrepeated remote reads | local cache or pre-stage to NVMe\nskip-heavy distributed reads | redesign shard count or use explicit node/rank splitting\nNFS stalls | move cache to node-local SSD/NVMe\nobject-store request pressure | larger shards, better row groups, fewer tiny files\nheavy preprocessing | materialize processed data\nmid-epoch failures | local staging or deterministic/resumable streaming format\n\n* * *\n\n## 11. Role-based responsibilities\n\nI would not put all responsibility on one party.\n\nDifferent actors can improve different parts of the stack.\n\nRole | Practical action\n---|---\nDataset publisher | publish topology-friendly shard layouts; document shard count, shard size, row group size\nTraining user | choose `split_dataset_by_node()`, tune DataLoader, control cache paths\nInfra/platform owner | provide node-local SSD/NVMe cache space; avoid hot caches on NFS\nHF Datasets maintainers | improve topology-aware sharding, diagnostics, docs, `world_size` guidance\nHub/Xet maintainers | improve transfer observability, cache behavior, failure diagnostics\nPyTorch/DataLoader maintainers | clarify/optimize worker, prefetch, pinned-memory behavior\nStorage-client maintainers | improve object-store read/cache semantics and visibility\n\nThe important point is:\n\n> A “streaming problem” may require a dataset-layout fix, not a streaming-library fix.\n\n* * *\n\n## 12. What diagnostics would make this much easier?\n\nA lot of confusion comes from not knowing which layer is doing extra work.\n\nUseful diagnostics would include:\n\nMetric | Why it helps\n---|---\nper-rank samples consumed | detect imbalance\nper-rank shards opened | detect unnecessary shard touching\nskipped samples / skipped bytes | detect wasteful sharding\ncache hit/miss rate | detect poor locality\nremote requests per worker | detect request amplification\nbytes downloaded per rank | detect duplication\nDataLoader queue depth | detect input bottleneck\nworker RSS | detect worker memory growth\npinned memory use | detect host memory pressure\ntime per shard transition | detect metadata or open overhead\nretry count / HTTP status | detect backend instability\nrow groups read | detect Parquet granularity issues\n\nEven a small debug mode that reports some of these would help users avoid guessing.\n\nA useful log format might look like:\n\n\n    rank=3 node=0 worker=5\n    samples=120000\n    shards_opened=18\n    shards_assigned=18\n    remote_bytes=42GB\n    skipped_samples=0\n    cache_hits=1532\n    cache_misses=12\n    rss_peak=7.8GB\n    avg_batch_wait_ms=41\n\n\nThis kind of visibility would make it much easier to tell whether the problem is layout, sharding, cache, DataLoader, or storage.\n\n* * *\n\n## 13. Practical checklist\n\nIf I had to debug a large training job today, I would use this order.\n\n### Step 1: Record versions and environment\n\n\n    python - <<'PY'\n    import platform, os\n    print(\"python\", platform.python_version())\n    print(\"platform\", platform.platform())\n\n    for name in [\n        \"HF_HOME\",\n        \"HF_HUB_CACHE\",\n        \"HF_XET_CACHE\",\n        \"HF_DATASETS_CACHE\",\n        \"HF_HUB_DISABLE_XET\",\n        \"HF_XET_HIGH_PERFORMANCE\",\n    ]:\n        print(name, os.environ.get(name))\n\n    try:\n        import datasets\n        print(\"datasets\", datasets.__version__)\n    except Exception as e:\n        print(\"datasets unavailable\", repr(e))\n\n    try:\n        import huggingface_hub\n        print(\"huggingface_hub\", huggingface_hub.__version__)\n    except Exception as e:\n        print(\"huggingface_hub unavailable\", repr(e))\n\n    try:\n        import torch\n        print(\"torch\", torch.__version__)\n    except Exception as e:\n        print(\"torch unavailable\", repr(e))\n\n    try:\n        import pyarrow\n        print(\"pyarrow\", pyarrow.__version__)\n    except Exception as e:\n        print(\"pyarrow unavailable\", repr(e))\n\n    try:\n        import hf_xet\n        print(\"hf_xet\", getattr(hf_xet, \"__version__\", \"unknown\"))\n    except Exception as e:\n        print(\"hf_xet unavailable\", repr(e))\n    PY\n\n\n### Step 2: Inspect layout\n\nRecord:\n\n\n    dataset format:\n    total size:\n    file count:\n    shard count:\n    average shard size:\n    largest shard:\n    smallest shard:\n    Parquet row group size:\n    compression:\n    remote backend:\n\n\n### Step 3: Check topology\n\nRecord:\n\n\n    nodes:\n    ranks per node:\n    world_size:\n    DataLoader workers per rank:\n    batch size per rank:\n    prefetch_factor:\n    pin_memory:\n    persistent_workers:\n\n\n### Step 4: Test minimal iteration\n\n\n    from datasets import load_dataset\n\n    ds = load_dataset(\"<dataset>\", split=\"train\", streaming=True)\n\n    for i, x in zip(range(10_000), ds):\n        if i % 1000 == 0:\n            print(i)\n\n\nNo DataLoader, no transforms, no Trainer.\n\n### Step 5: Add DataLoader with no multiprocessing\n\n\n    from torch.utils.data import DataLoader\n\n    loader = DataLoader(\n        ds,\n        batch_size=<batch_size>,\n        num_workers=0,\n        pin_memory=False,\n    )\n\n    for i, batch in enumerate(loader):\n        if i % 100 == 0:\n            print(i)\n\n\n### Step 6: Add workers gradually\n\n\n    loader = DataLoader(\n        ds,\n        batch_size=<batch_size>,\n        num_workers=1,\n        pin_memory=False,\n        persistent_workers=False,\n    )\n\n\nThen try:\n\n\n    loader = DataLoader(\n        ds,\n        batch_size=<batch_size>,\n        num_workers=4,\n        prefetch_factor=1,\n        pin_memory=False,\n        persistent_workers=False,\n    )\n\n\nOnly after that, test:\n\n\n    pin_memory=True\n    persistent_workers=True\n\n\n### Step 7: Test distributed split explicitly\n\n\n    from datasets import load_dataset\n    from datasets.distributed import split_dataset_by_node\n\n    ds = load_dataset(\"<dataset>\", split=\"train\", streaming=True)\n    ds = split_dataset_by_node(ds, rank=<rank>, world_size=<world_size>)\n\n\nThen verify that each rank sees the expected data volume and does not duplicate work.\n\n### Step 8: Move cache to local disk\n\n\n    export HF_HOME=/local_nvme/hf\n    export HF_HUB_CACHE=/local_nvme/hf/hub\n    export HF_XET_CACHE=/local_nvme/hf/xet\n    export HF_DATASETS_CACHE=/local_nvme/hf/datasets\n\n\nThen rerun the smallest reproduction.\n\n### Step 9: Decide whether to stream, stage, or materialize\n\nSituation | Safer choice\n---|---\nquick exploration | direct streaming\nsingle-node training | streaming may be fine if layout is good\nmulti-node production training | consider pre-staging or topology-aware shard layout\nheavy preprocessing | materialize processed data\nunstable object-store path | local staging/cache\nstrict restart/resume needs | deterministic/resumable streaming format\n\n* * *\n\n## 14. Current practical conclusion\n\nI would summarize it this way:\n\n`streaming=True` is a useful interface, but large-scale training needs more than lazy iteration.\n\nIt needs a data access plan that is aware of:\n\n  * dataset layout\n  * shard count\n  * world size\n  * node/rank/worker assignment\n  * cache locality\n  * DataLoader memory behavior\n  * storage backend semantics\n  * restart/resume behavior\n\n\n\nSo the safest practical recommendation is not:\n\n> “Just turn on streaming.”\n\nIt is closer to:\n\n> Design the dataset layout and cache path for the training topology, then use streaming as one component of that data plane.\n\nFor many production-scale jobs, the best workaround may be:\n\n\n    1. preprocess/materialize if transforms are heavy\n    2. publish balanced shards\n    3. choose shard counts compatible with expected world sizes\n    4. split explicitly by node/rank\n    5. keep cache on local NVMe/SSD\n    6. tune DataLoader concurrency conservatively\n    7. monitor skipped samples, cache hits, remote bytes, worker RSS\n\n\nThat does not mean direct streaming is bad.\n\nIt means direct remote streaming is only one point in the design space.\n\nFor small jobs, it is simple and convenient.\n\nFor large distributed training, topology-aware sharding and cache locality become first-class concerns.\n\n* * *\n\n## 15. Link collection\n\n### Hugging Face official docs / posts\n\n  * Streaming datasets: 100x More Efficient\n  * Datasets streaming docs\n  * Dataset vs IterableDataset\n  * Datasets main classes / IterableDataset reference\n  * Datasets cache management\n  * Hub local cache\n  * huggingface_hub environment variables\n  * huggingface_hub cache management\n  * Upload files to the Hub / Xet cache guidance\n  * Using Xet storage\n\n\n\n### Hugging Face forum / issues\n\n  * Forum: Loading webdatasets across multiple nodes\n  * Forum: How to use split_dataset_by_node and shuffle on IterableDataset\n  * Forum: Keeping IterableDataset node-wise split fixed during DDP\n  * Forum: A streaming dataset’s memory footprint continually grows\n  * Forum: Speeding up Streaming of Large Datasets / FineWeb\n  * Issue: IterableDataset distributed/DataLoader sharding performance traps\n  * Issue: streaming datasets doesn’t work properly with multi-node\n  * Issue: Problem in training iterable dataset\n  * Issue: Memory leak when streaming\n  * Issue: More easily support streaming local files\n  * Issue: Make Dataset streaming queries retryable\n  * Issue: HF_DATASETS_CACHE ignored?\n\n\n\n### PyTorch / DataLoader\n\n  * PyTorch DataLoader docs\n  * TorchData Stateful DataLoader docs\n  * PyTorch issue: pin_memory / prefetch / persistent_workers memory behavior\n\n\n\n### Alternative / adjacent streaming systems\n\n  * WebDataset multinode docs\n  * WebDataset GitHub\n  * WebDataset PyPI notes\n  * MosaicML StreamingDataset GitHub\n  * MosaicML StreamingDataset API docs\n  * Databricks/MosaicML blog: StreamingDataset\n\n\n\n### Object storage / filesystem layer\n\n  * fsspec docs\n  * s3fs docs\n  * s3fs issue: large file read, block size, cache type\n\n",
  "title": "Datasets streaming bottlenecks: storage client memory/GC pressure vs sharding/cache/request amplification?"
}