External Publication

Datasets streaming bottlenecks: storage client memory/GC pressure vs sharding/cache/request amplification?

Hugging Face Forums [Unofficial] June 4, 2026

Hmm… This isn’t based on personal experience, but since it doesn’t seem like a simple problem, it looks like it would be best to sort things out first. This looks like a pretty layered problem:

Direct answer

Yes — I think this is a real pain point, but I would not frame it as one single bottleneck or as one generic “storage client problem”.

For large-scale training, the bottleneck can absolutely come from the data access layer rather than raw network bandwidth. But in practice, “data access layer” usually means several interacting pieces:

dataset layout
  -> sharding / rank-node-worker assignment
  -> DataLoader prefetch / multiprocessing / pinned memory
  -> Hub/Xet or S3/GCS/Azure/fsspec/FUSE transport
  -> local cache placement: NVMe/SSD vs NFS/shared filesystem
  -> training topology: world_size, nodes, workers, epochs

So my short answer would be:

Question	Practical answer
Is this a real pain point?	Yes, especially for multi-worker or multi-node training, large Parquet/WebDataset-style corpora, object storage, and shared filesystems.
Is it just raw bandwidth?	Usually no. Request patterns, shard layout, cache locality, DataLoader behavior, and memory pressure can dominate.
Is request amplification still the main issue?	Not always. HF has already improved the Hub streaming path significantly, including data-file resolution, request reduction, Parquet prefetching, and buffering. See Streaming datasets: 100x More Efficient.
Is memory/RSS growth real?	There are reports of it, but I would not immediately blame only `datasets`. PyTorch DataLoader workers, prefetching, pinned memory, transforms, decoding, Arrow/Parquet readers, and fsspec caches can all contribute.
Is sharding important?	Very important. Bad alignment between shard count and `world_size` can cause skipped samples, duplicated reads, poor locality, or uneven work.
Would a Rust streaming/cache layer help?	Maybe, but only if it solves topology-aware shard scheduling, bounded prefetch/memory, local cache locality, and observability. “Rust” alone is not the product; the useful thing would be a clearer data-plane contract.
What should users do today?	Use topology-friendly shards, split by node/rank explicitly, keep cache on local NVMe/SSD, tune DataLoader conservatively, and materialize heavy preprocessing when needed.

So I would describe the minimum useful integration not as “a faster downloader”, but as something closer to:

a topology-aware streaming data plane for training.

That would include:

shard planning aware of world_size, ranks, nodes, and DataLoader workers
node-local cache locality across epochs
bounded-memory prefetching
clear interaction with PyTorch DataLoader
support for Hub/Xet and object-storage-like backends
diagnostics for skipped samples, duplicate reads, cache hits, remote bytes, request count, worker RSS, and per-rank throughput

Without those pieces, a lower-level cache or storage client might improve one workload while leaving the main distributed-training failure mode unresolved.

1. Why this is not just “streaming is slow”

For small or exploratory use, streaming=True is often exactly the right tool: it lets you iterate over large datasets without downloading everything first. The official docs describe dataset streaming as a way to work with a dataset as you iterate over it, instead of fully downloading it first:

Datasets streaming docs
Dataset vs IterableDataset

But for large-scale training, especially with many DataLoader workers, multiple ranks, multiple nodes, remote object storage, or shared filesystems, the bottleneck is often not simply “network bandwidth”.

A useful first-pass model is:

streaming performance = layout + sharding + DataLoader + cache + storage + topology

This is why the same dataset can feel fine in one setup and painful in another:

Setup	Likely behavior
single process, local cache, simple iteration	usually easy to reason about
one GPU, remote Parquet, moderate workers	often workable, but DataLoader tuning matters
multi-GPU single node	shard splitting and batch consistency start to matter
multi-node DDP	rank/node-aware sharding and cache locality become first-class
object storage + many small files	request count and metadata/listing overhead can dominate
NFS/shared cache	cache contention and filesystem behavior can dominate
heavy transform/decode in workers	CPU/RSS/DataLoader behavior can dominate

2. Important context: some older request-amplification issues have changed

One reason this topic is delicate is that Hugging Face has already improved part of the streaming stack.

The official post Streaming datasets: 100x More Efficient describes improvements such as:

100x fewer requests
10x faster data resolution
2x sample/sec
0 worker crashes at 256 concurrent workers
better data-files caching across DataLoader workers
Parquet prefetching
more configurable buffering

So I would avoid a blanket claim like:

“HF streaming causes request storms.”

A safer statement is:

Some request-amplification paths have been improved, but large-scale training can still suffer when dataset layout, sharding, cache placement, and DataLoader concurrency are not aligned.

That distinction matters. The remaining pain is probably less about “plain download is slow” and more about distributed coordination.

Relevant links:

Streaming datasets: 100x More Efficient
Datasets streaming docs
IterableDataset / main classes docs

3. Dataset layout matters more than it first appears

A streaming dataset is not just “a dataset stored remotely”.

The physical layout can determine:

how many remote requests are needed
how many shards each rank must open
whether shards can be assigned cleanly to nodes
whether cache locality is possible
how much data is skipped
whether prefetch helps or hurts
whether one worker can keep a GPU fed

Important layout variables:

Variable	Why it matters
Number of shards	Too few shards can limit parallelism; too many tiny shards can increase request and metadata overhead
Shard size	Huge shards can reduce scheduling flexibility; tiny shards can increase overhead
Parquet row group size	Affects range-read granularity and filtering/projection behavior
Compression	Can move the bottleneck from network to CPU
File count	Many small files can be painful on object storage
Balance across shards	Uneven shards can cause per-rank imbalance
Divisibility by `world_size`	Important for avoiding wasteful skip behavior
Format	Parquet, Arrow, WebDataset tar shards, MDS, raw JSONL, image/audio files all behave differently

A practical rule:

If you control the dataset layout, design shards for the training topology, not only for storage convenience.

For large training jobs, I would prefer:

enough shards to distribute across ranks/workers
shard counts divisible by common world_size values
reasonably balanced shard sizes
Parquet/WebDataset/MDS-style layouts over many tiny raw files
predictable row-group or shard boundaries
avoiding a single huge file unless the reader can split it well

Relevant links:

HF Datasets streaming docs
HF Hub dataset streaming guide
WebDataset multinode docs
MosaicML StreamingDataset
StreamingDataset API docs

4. The `num_shards` vs `world_size` issue is probably central

One of the most important practical questions is:

Does the dataset shard cleanly across the distributed training topology?

The HF docs for iterable datasets describe a distinction that is easy to miss:

if the dataset has a number of shards that divides evenly across world_size, shards can be assigned more cleanly
otherwise, each node may keep only a fraction of examples and skip the others

That means a bad shard topology can turn into:

unnecessary reads
wasted remote traffic
low cache hit rate
poor locality
uneven work across ranks
slow epoch boundaries

A practical mental model:

good case:
  rank/node reads mostly the shards it will actually train on

bad case:
  rank/node touches lots of data, then discards/skips much of it

For small datasets, this may be acceptable.

For 10 TB, 100 TB, or multi-node training, this can become the whole problem.

Relevant links:

HF Datasets main classes / IterableDataset docs
Issue: IterableDataset distributed/DataLoader sharding performance traps
Issue: streaming datasets doesn’t work properly with multi-node
Issue: Problem in training iterable dataset
Forum: Loading webdatasets across multiple nodes

5. `split_dataset_by_node()` helps, but it is not a universal fix

For multi-node training, split_dataset_by_node() is an important tool.

The forum answer in Loading webdatasets across multiple nodes is especially useful because it states the practical distinction clearly:

by default, each node may stream the full dataset and skip samples to avoid duplicates
split_dataset_by_node() can assign shards to each node so each node streams only what it uses

That is a big difference.

But it does not mean every distributed streaming problem disappears.

Things that still need care:

Issue	Why it matters
Shard count not divisible by world size	Can still cause unevenness or skipping
Shuffle behavior	Shuffling can fight cache locality
Epoch boundaries	Some reports mention hangs or crashes after an epoch
Batch size consistency	DDP/gradient sync can dislike uneven batches
Worker-level splitting	Node-level splitting is not always enough
Resumption	Mid-epoch restart can be expensive if ordering is not deterministic
Cache reuse	A split that changes every epoch can reduce cache locality

Useful links:

Forum: How to use split_dataset_by_node and shuffle on IterableDataset
Forum: Keeping IterableDataset node-wise split fixed during DDP
Forum: Loading webdatasets across multiple nodes
Issue: streaming datasets doesn’t work properly with multi-node

6. DataLoader settings can make a streaming issue look like a memory leak

If RSS grows during streaming, I would not immediately conclude:

“HF Datasets has a memory leak.”

It might be true in some cases, but it is not the first conclusion I would jump to.

A streaming pipeline often involves:

remote read
  -> decompression / decoding
  -> Python objects
  -> Arrow / Parquet reader
  -> map/filter/transform
  -> DataLoader worker process
  -> prefetch queue
  -> collate_fn
  -> pinned memory
  -> GPU transfer

Any of these can hold references or buffer data.

Important DataLoader knobs:

Knob	Why it matters
`num_workers`	More worker processes can increase parallelism and memory use
`prefetch_factor`	Each worker can prepare multiple batches ahead
`pin_memory`	Puts fetched tensors in pinned host memory for faster GPU transfer
`persistent_workers`	Keeps worker processes alive across epochs
custom `collate_fn`	Can accidentally retain objects or create large temporary structures
decode/transform in workers	Can move CPU and memory pressure into worker processes

Relevant links:

PyTorch DataLoader docs
TorchData Stateful DataLoader docs
PyTorch issue: pin_memory / prefetch / persistent_workers memory behavior
Forum: A streaming dataset’s memory footprint continually grows
Issue: Memory leak when streaming

A good debugging baseline:

DataLoader(
    dataset,
    batch_size=<batch_size>,
    num_workers=0,
    pin_memory=False,
    persistent_workers=False,
)

Then add concurrency back one variable at a time:

# Step 1: no multiprocessing
num_workers = 0

# Step 2: one worker
num_workers = 1

# Step 3: several workers, low prefetch
num_workers = 4
prefetch_factor = 1

# Step 4: only then test pinning/persistence
pin_memory = True
persistent_workers = True

If memory only grows after num_workers > 0, it may be DataLoader multiprocessing, prefetch, transform, or worker lifetime.

If memory grows even with num_workers=0, the issue is more likely in dataset iteration, decoding, transformation, Arrow/Parquet, Python object retention, or user code.

7. Cache placement is part of the runtime contract

For large jobs, cache is not just a convenience.

It is part of the runtime design.

Relevant Hugging Face cache variables:

Variable	Role
`HF_HOME`	Base directory for Hugging Face local state
`HF_HUB_CACHE`	Hub repository/file cache
`HF_XET_CACHE`	Xet chunk cache
`HF_DATASETS_CACHE`	Datasets cache files, not the same as Hub file cache

Links:

huggingface_hub environment variables
huggingface_hub cache management
Datasets cache management
Issue: HF_DATASETS_CACHE ignored?

A common cluster problem is that $HOME points to NFS or another shared filesystem.

That can make defaults bad:

# maybe bad on clusters if $HOME is on NFS
~/.cache/huggingface

A safer pattern for large training jobs is usually:

export HF_HOME=/local_nvme/hf
export HF_HUB_CACHE=/local_nvme/hf/hub
export HF_XET_CACHE=/local_nvme/hf/xet
export HF_DATASETS_CACHE=/local_nvme/hf/datasets

The exact paths depend on the cluster, but the principle is:

Put high-churn caches on node-local SSD/NVMe when possible.

HF’s upload guide explicitly recommends putting HF_XET_CACHE on a local disk such as NVMe/SSD when uploading from distributed filesystems like NFS, EBS, Lustre, or FSx.

Relevant links:

HF upload guide: local Xet cache on NVMe/SSD
Environment variables: HF_XET_CACHE
Hub local cache docs

8. Hub/Xet is a separate layer from Datasets iteration

It is also useful to separate:

Datasets iteration problem
vs
Hub transfer/cache problem

The Hub has moved toward Xet-backed storage and hf_xet for file transfer and chunk caching.

That means a dataset load can involve more than “plain HTTP file download”.

But not every Datasets streaming problem is an Xet problem.

I would separate these cases:

Symptom	More likely layer
CLI download stalls	Hub/Xet/network/cache
`load_dataset(..., streaming=True)` stalls but CLI download works	Datasets iteration / reader / transform / DataLoader
Xet disabled changes behavior	Hub/Xet/cache path
only S3/GCS/FUSE path is slow	object storage client / filesystem abstraction
only multi-node training is slow	sharding/topology/cache locality
only `num_workers > 0` leaks memory	DataLoader / multiprocessing / prefetch

Potential diagnostic toggles for Hub/Xet transfer paths:

# Current default path
hf download <repo_id> --repo-type dataset

# Xet high-performance mode
HF_XET_HIGH_PERFORMANCE=1 hf download <repo_id> --repo-type dataset

# Xet-disabled diagnostic path
HF_HUB_DISABLE_XET=1 hf download <repo_id> --repo-type dataset

I would treat HF_HUB_DISABLE_XET=1 as a diagnostic switch, not a universal permanent fix.

Relevant links:

Using Xet storage
huggingface_hub environment variables
HF upload guide

9. S3/fsspec/FUSE/object storage can be a hidden bottleneck

If the dataset is not on the Hub, there is another layer:

S3 / GCS / Azure Blob / custom object storage
  -> fsspec / s3fs / gcsfs / adlfs
  -> range reads / block cache / retry policy
  -> Python reader / Parquet reader / DataLoader worker

A path that looks like a normal filesystem may not behave like a normal filesystem.

FUSE/object-store mounts can hide:

range-request amplification
poor random-read behavior
slow stat/list operations
file handle pressure
local block cache growth
retry storms
credential refresh issues
differences between sequential and random reads

For S3/fsspec specifically, block size and cache type can strongly affect both speed and memory.

Relevant links:

s3fs issue: large file read, block size, cache type
fsspec documentation
s3fs documentation

A practical rule:

If changing only the storage backend changes the symptom, do not debug only datasets.

10. Direct remote streaming is not always the safest production path

For exploration, streaming=True is excellent.

For production-scale training, I would be more cautious.

If the job is expensive, multi-node, or long-running, it may be safer to stage data explicitly:

remote dataset
  -> node-local NVMe / local shard cache
  -> train from local shards

or:

remote raw data
  -> preprocessing job
  -> materialized Parquet/WebDataset/MDS shards
  -> training job

This is not as elegant as direct streaming, but it is often easier to debug and reproduce.

Workaround options:

If you see…	Consider…
low GPU utilization	increase DataLoader concurrency carefully; profile CPU/decode/storage
high RSS	reduce workers/prefetch/pinning; remove transforms; test `num_workers=0`
repeated remote reads	local cache or pre-stage to NVMe
skip-heavy distributed reads	redesign shard count or use explicit node/rank splitting
NFS stalls	move cache to node-local SSD/NVMe
object-store request pressure	larger shards, better row groups, fewer tiny files
heavy preprocessing	materialize processed data
mid-epoch failures	local staging or deterministic/resumable streaming format

11. Role-based responsibilities

I would not put all responsibility on one party.

Different actors can improve different parts of the stack.

Role	Practical action
Dataset publisher	publish topology-friendly shard layouts; document shard count, shard size, row group size
Training user	choose `split_dataset_by_node()`, tune DataLoader, control cache paths
Infra/platform owner	provide node-local SSD/NVMe cache space; avoid hot caches on NFS
HF Datasets maintainers	improve topology-aware sharding, diagnostics, docs, `world_size` guidance
Hub/Xet maintainers	improve transfer observability, cache behavior, failure diagnostics
PyTorch/DataLoader maintainers	clarify/optimize worker, prefetch, pinned-memory behavior
Storage-client maintainers	improve object-store read/cache semantics and visibility

The important point is:

A “streaming problem” may require a dataset-layout fix, not a streaming-library fix.

12. What diagnostics would make this much easier?

A lot of confusion comes from not knowing which layer is doing extra work.

Useful diagnostics would include:

Metric	Why it helps
per-rank samples consumed	detect imbalance
per-rank shards opened	detect unnecessary shard touching
skipped samples / skipped bytes	detect wasteful sharding
cache hit/miss rate	detect poor locality
remote requests per worker	detect request amplification
bytes downloaded per rank	detect duplication
DataLoader queue depth	detect input bottleneck
worker RSS	detect worker memory growth
pinned memory use	detect host memory pressure
time per shard transition	detect metadata or open overhead
retry count / HTTP status	detect backend instability
row groups read	detect Parquet granularity issues

Even a small debug mode that reports some of these would help users avoid guessing.

A useful log format might look like:

rank=3 node=0 worker=5
samples=120000
shards_opened=18
shards_assigned=18
remote_bytes=42GB
skipped_samples=0
cache_hits=1532
cache_misses=12
rss_peak=7.8GB
avg_batch_wait_ms=41

This kind of visibility would make it much easier to tell whether the problem is layout, sharding, cache, DataLoader, or storage.

13. Practical checklist

If I had to debug a large training job today, I would use this order.

Step 1: Record versions and environment

python - <<'PY'
import platform, os
print("python", platform.python_version())
print("platform", platform.platform())

for name in [
    "HF_HOME",
    "HF_HUB_CACHE",
    "HF_XET_CACHE",
    "HF_DATASETS_CACHE",
    "HF_HUB_DISABLE_XET",
    "HF_XET_HIGH_PERFORMANCE",
]:
    print(name, os.environ.get(name))

try:
    import datasets
    print("datasets", datasets.__version__)
except Exception as e:
    print("datasets unavailable", repr(e))

try:
    import huggingface_hub
    print("huggingface_hub", huggingface_hub.__version__)
except Exception as e:
    print("huggingface_hub unavailable", repr(e))

try:
    import torch
    print("torch", torch.__version__)
except Exception as e:
    print("torch unavailable", repr(e))

try:
    import pyarrow
    print("pyarrow", pyarrow.__version__)
except Exception as e:
    print("pyarrow unavailable", repr(e))

try:
    import hf_xet
    print("hf_xet", getattr(hf_xet, "__version__", "unknown"))
except Exception as e:
    print("hf_xet unavailable", repr(e))
PY

Step 2: Inspect layout

Record:

dataset format:
total size:
file count:
shard count:
average shard size:
largest shard:
smallest shard:
Parquet row group size:
compression:
remote backend:

Step 3: Check topology

Record:

nodes:
ranks per node:
world_size:
DataLoader workers per rank:
batch size per rank:
prefetch_factor:
pin_memory:
persistent_workers:

Step 4: Test minimal iteration

from datasets import load_dataset

ds = load_dataset("<dataset>", split="train", streaming=True)

for i, x in zip(range(10_000), ds):
    if i % 1000 == 0:
        print(i)

No DataLoader, no transforms, no Trainer.

Step 5: Add DataLoader with no multiprocessing

from torch.utils.data import DataLoader

loader = DataLoader(
    ds,
    batch_size=<batch_size>,
    num_workers=0,
    pin_memory=False,
)

for i, batch in enumerate(loader):
    if i % 100 == 0:
        print(i)

Step 6: Add workers gradually

loader = DataLoader(
    ds,
    batch_size=<batch_size>,
    num_workers=1,
    pin_memory=False,
    persistent_workers=False,
)

Then try:

loader = DataLoader(
    ds,
    batch_size=<batch_size>,
    num_workers=4,
    prefetch_factor=1,
    pin_memory=False,
    persistent_workers=False,
)

Only after that, test:

pin_memory=True
persistent_workers=True

Step 7: Test distributed split explicitly

from datasets import load_dataset
from datasets.distributed import split_dataset_by_node

ds = load_dataset("<dataset>", split="train", streaming=True)
ds = split_dataset_by_node(ds, rank=<rank>, world_size=<world_size>)

Then verify that each rank sees the expected data volume and does not duplicate work.

Step 8: Move cache to local disk

export HF_HOME=/local_nvme/hf
export HF_HUB_CACHE=/local_nvme/hf/hub
export HF_XET_CACHE=/local_nvme/hf/xet
export HF_DATASETS_CACHE=/local_nvme/hf/datasets

Then rerun the smallest reproduction.

Step 9: Decide whether to stream, stage, or materialize

Situation	Safer choice
quick exploration	direct streaming
single-node training	streaming may be fine if layout is good
multi-node production training	consider pre-staging or topology-aware shard layout
heavy preprocessing	materialize processed data
unstable object-store path	local staging/cache
strict restart/resume needs	deterministic/resumable streaming format

14. Current practical conclusion

I would summarize it this way:

streaming=True is a useful interface, but large-scale training needs more than lazy iteration.

It needs a data access plan that is aware of:

dataset layout
shard count
world size
node/rank/worker assignment
cache locality
DataLoader memory behavior
storage backend semantics
restart/resume behavior

So the safest practical recommendation is not:

“Just turn on streaming.”

It is closer to:

Design the dataset layout and cache path for the training topology, then use streaming as one component of that data plane.

For many production-scale jobs, the best workaround may be:

1. preprocess/materialize if transforms are heavy
2. publish balanced shards
3. choose shard counts compatible with expected world sizes
4. split explicitly by node/rank
5. keep cache on local NVMe/SSD
6. tune DataLoader concurrency conservatively
7. monitor skipped samples, cache hits, remote bytes, worker RSS

That does not mean direct streaming is bad.

It means direct remote streaming is only one point in the design space.

For small jobs, it is simple and convenient.

For large distributed training, topology-aware sharding and cache locality become first-class concerns.

15. Link collection

Hugging Face official docs / posts

Streaming datasets: 100x More Efficient
Datasets streaming docs
Dataset vs IterableDataset
Datasets main classes / IterableDataset reference
Datasets cache management
Hub local cache
huggingface_hub environment variables
huggingface_hub cache management
Upload files to the Hub / Xet cache guidance
Using Xet storage

Hugging Face forum / issues

Forum: Loading webdatasets across multiple nodes
Forum: How to use split_dataset_by_node and shuffle on IterableDataset
Forum: Keeping IterableDataset node-wise split fixed during DDP
Forum: A streaming dataset’s memory footprint continually grows
Forum: Speeding up Streaming of Large Datasets / FineWeb
Issue: IterableDataset distributed/DataLoader sharding performance traps
Issue: streaming datasets doesn’t work properly with multi-node
Issue: Problem in training iterable dataset
Issue: Memory leak when streaming
Issue: More easily support streaming local files
Issue: Make Dataset streaming queries retryable
Issue: HF_DATASETS_CACHE ignored?

PyTorch / DataLoader

PyTorch DataLoader docs
TorchData Stateful DataLoader docs
PyTorch issue: pin_memory / prefetch / persistent_workers memory behavior

Alternative / adjacent streaming systems

WebDataset multinode docs
WebDataset GitHub
WebDataset PyPI notes
MosaicML StreamingDataset GitHub
MosaicML StreamingDataset API docs
Databricks/MosaicML blog: StreamingDataset

Object storage / filesystem layer

fsspec docs
s3fs docs
s3fs issue: large file read, block size, cache type

Direct answer

1. Why this is not just “streaming is slow”

2. Important context: some older request-amplification issues have changed

3. Dataset layout matters more than it first appears

4. The num_shards vs world_size issue is probably central

5. split_dataset_by_node() helps, but it is not a universal fix

6. DataLoader settings can make a streaming issue look like a memory leak

7. Cache placement is part of the runtime contract

8. Hub/Xet is a separate layer from Datasets iteration

9. S3/fsspec/FUSE/object storage can be a hidden bottleneck

10. Direct remote streaming is not always the safest production path

11. Role-based responsibilities

12. What diagnostics would make this much easier?

13. Practical checklist

Step 1: Record versions and environment

Step 2: Inspect layout

Step 3: Check topology

Step 4: Test minimal iteration

Step 5: Add DataLoader with no multiprocessing

Step 6: Add workers gradually

Step 7: Test distributed split explicitly

Step 8: Move cache to local disk

Step 9: Decide whether to stream, stage, or materialize

14. Current practical conclusion

15. Link collection

Hugging Face official docs / posts

Hugging Face forum / issues

PyTorch / DataLoader

Alternative / adjacent streaming systems

Object storage / filesystem layer

Discussion in the ATmosphere

4. The `num_shards` vs `world_size` issue is probably central

5. `split_dataset_by_node()` helps, but it is not a universal fix