Datasets streaming bottlenecks: storage client memory/GC pressure vs sharding/cache/request amplification?
Hugging Face Forums [Unofficial]
June 3, 2026
Hey folks, it’s my first time to be here, and very nice to join the community
I’m validating a problem around large-scale HF Datasets streaming for training.
I’m interested in cases where people stream datasets from the Hub, S3-compatible storage, Azure/GCS, or custom object storage instead of fully downloading the dataset first.
The specific question: when streaming at scale, have you seen the bottleneck come from the data access layer itself — memory growth, allocation pressure, GC, cache behavior, request amplification, or FUSE/client overhead, rather than just raw network bandwidth?
Questions:
* Are you using datasets with streaming=True, WebDataset, Parquet shards, fsspec, s3fs, custom dataset scripts, or local cache?
* Does performance degrade with many workers, many nodes, or many small files?
* Have you seen worker crashes, unstable throughput, high RSS, cache blowups, slow shard resolution, or excessive requests?
* How do you shard data across nodes? Do all nodes stream everything and skip samples, or does each node get assigned shards?
* Have you traced the bottleneck to Python, object storage request count, metadata, network, FUSE, storage SDK, or runtime GC?
* Would a Rust-based streaming/cache layer be useful if it integrated with HF Datasets and exposed stable memory, prefetching, shard-aware scheduling, and local NVMe cache?
I’m trying to understand whether this is a real pain point for training users, and what the minimum useful integration would need to support.
Really appreciate your helps!
Discussion in the ATmosphere