External Publication
Visit Post

Custom Local Dataset Caching - Controlling Save & Loading

Hugging Face Forums [Unofficial] June 13, 2026
Source

I think same like John6666 suggested for you, the confusion comes from the fact that the Arrow files created in ~/.cache/huggingface/datasets are an internal cache used by load_dataset()

Not a dataset artifact that you are supposed to reload manually with load_dataset(“arrow”, …).

So if your JSON file loads successfully and you see dataset_info.json plus .arrow files in the cache, that usually just means the library processed your raw JSON into its internal format. It does not mean that the right next step is to point load_dataset() directly at that cache folder.

If you want to reuse the processed dataset later in another script, the recommended approach is to explicitly save it with save_to_disk() and then reload it with load_from_disk() :

```python

from datasets import load_dataset

dataset = load_dataset(“json”, data_files=“/Volumes/XXXX/MyDataset.json”, split=“train”,)

dataset.save_to_disk(“/path/to/my_saved_dataset”)


Discussion in the ATmosphere

Loading comments...