External Publication
Visit Post

Load_dataset() creates a duplicate in cache

Hugging Face Forums [Unofficial] April 25, 2026
Source

I have around 500GB of sharded Parquet files stored locally in a single directory. When I load them using load_dataset("parquet", data_dir="/path_to_data"), everything works as expected. However, I’ve noticed that the dataset is effectively duplicated in the Hugging Face cache directory (/path_to_cache/datasets/parquet), which ends up consuming another ~500GB of storage.

Why does this duplication happen internally? Is there a way to load or reference the dataset without triggering this additional copy in the cache?

I understand that if load_dataset is being used to download a dataset, then it must save it somewhere, but in my case its already saved in a different folder, there shouldnt be any need to create a secondary copy unless something else is happening that I dont understand

Discussion in the ATmosphere

Loading comments...