External Publication
Visit Post

Datasets streaming bottlenecks: storage client memory/GC pressure vs sharding/cache/request amplification?

Hugging Face Forums [Unofficial] June 3, 2026
Source
Hey folks, it’s my first time to be here, and very nice to join the community​ I’m validating a problem around large-scale HF Datasets streaming for training. I’m interested in cases where people stream datasets from the Hub, S3-compatible storage, Azure/GCS, or custom object storage instead of fully downloading the dataset first. The specific question: when streaming at scale, have you seen the bottleneck come from the data access layer itself — memory growth, allocation pressure, GC, cache behavior, request amplification, or FUSE/client overhead, rather than just raw network bandwidth? Questions: * Are you using datasets with streaming=True, WebDataset, Parquet shards, fsspec, s3fs, custom dataset scripts, or local cache? * Does performance degrade with many workers, many nodes, or many small files? * Have you seen worker crashes, unstable throughput, high RSS, cache blowups, slow shard resolution, or excessive requests? * How do you shard data across nodes? Do all nodes stream everything and skip samples, or does each node get assigned shards? * Have you traced the bottleneck to Python, object storage request count, metadata, network, FUSE, storage SDK, or runtime GC? * Would a Rust-based streaming/cache layer be useful if it integrated with HF Datasets and exposed stable memory, prefetching, shard-aware scheduling, and local NVMe cache? I’m trying to understand whether this is a real pain point for training users, and what the minimum useful integration would need to support. Really appreciate your helps!

Discussion in the ATmosphere

Loading comments...