Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihfacj4ggsv5xvihl66bntjbmgvy4n3gqkzoazi646pl66k4u4rbi",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkdmcslnzxf2"
  },
  "path": "/t/load-dataset-creates-a-duplicate-in-cache/175561#post_1",
  "publishedAt": "2026-04-25T16:18:52.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "I have around 500GB of sharded Parquet files stored locally in a single directory. When I load them using `load_dataset(\"parquet\", data_dir=\"/path_to_data\")`, everything works as expected. However, I’ve noticed that the dataset is effectively duplicated in the Hugging Face cache directory (`/path_to_cache/datasets/parquet`), which ends up consuming another ~500GB of storage.\n\nWhy does this duplication happen internally? Is there a way to load or reference the dataset without triggering this additional copy in the cache?\n\nI understand that if load_dataset is being used to download a dataset, then it must save it somewhere, but in my case its already saved in a different folder, there shouldnt be any need to create a secondary copy unless something else is happening that I dont understand",
  "title": "Load_dataset() creates a duplicate in cache"
}