{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreib5i4wesizsabowb52suj4mkwyen2ftekcs3g7y7wba76jyyrknki",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnesyqec66t2"
},
"path": "/t/datasets-streaming-bottlenecks-storage-client-memory-gc-pressure-vs-sharding-cache-request-amplification/176485#post_1",
"publishedAt": "2026-06-03T08:02:14.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Hey folks, it’s my first time to be here, and very nice to join the community\n\nI’m validating a problem around large-scale HF Datasets streaming for training.\n\nI’m interested in cases where people stream datasets from the Hub, S3-compatible storage, Azure/GCS, or custom object storage instead of fully downloading the dataset first.\n\nThe specific question: when streaming at scale, have you seen the bottleneck come from the data access layer itself — memory growth, allocation pressure, GC, cache behavior, request amplification, or FUSE/client overhead, rather than just raw network bandwidth?\n\nQuestions:\n\n * Are you using datasets with streaming=True, WebDataset, Parquet shards, fsspec, s3fs, custom dataset scripts, or local cache?\n\n * Does performance degrade with many workers, many nodes, or many small files?\n\n * Have you seen worker crashes, unstable throughput, high RSS, cache blowups, slow shard resolution, or excessive requests?\n\n * How do you shard data across nodes? Do all nodes stream everything and skip samples, or does each node get assigned shards?\n\n * Have you traced the bottleneck to Python, object storage request count, metadata, network, FUSE, storage SDK, or runtime GC?\n\n * Would a Rust-based streaming/cache layer be useful if it integrated with HF Datasets and exposed stable memory, prefetching, shard-aware scheduling, and local NVMe cache?\n\n\n\n\nI’m trying to understand whether this is a real pain point for training users, and what the minimum useful integration would need to support.\n\nReally appreciate your helps!",
"title": "Datasets streaming bottlenecks: storage client memory/GC pressure vs sharding/cache/request amplification?"
}