{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreigfyuk7xq66gzbpv6dloey6koylk7fxsajpwzonhulaktp7mfukda",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mo6ne2tdvj22"
},
"path": "/t/custom-local-dataset-caching-controlling-save-loading/176741#post_3",
"publishedAt": "2026-06-13T15:49:18.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "I think same like **John6666** suggested for you, the confusion comes from the fact that the Arrow files created in _`~/.cache/huggingface/datasets`_ are an internal cache used by _`load_dataset()`_\n\nNot a dataset artifact that you are supposed to reload manually with _`load_dataset(“arrow”, …)`._\n\nSo if your JSON file loads successfully and you see _`dataset_info.json`_ plus _`.arrow`_ files in the cache, that usually just means the library processed your raw JSON into its internal format. It does ****not**** mean that the right next step is to point _`load_dataset()`_ directly at that cache folder.\n\nIf you want to reuse the processed dataset later in another script, the recommended approach is to explicitly save it with _`save_to_disk()`_ and then reload it with _`load_from_disk()`_ :\n\n_```python_\n\n_from datasets import load_dataset_\n\n_dataset = load_dataset(“json”,\ndata_files=“/Volumes/XXXX/MyDataset.json”,\nsplit=“train”,)_\n\n_dataset.save_to_disk(“/path/to/my_saved_dataset”)_\n\n```And Hello guys btw",
"title": "Custom Local Dataset Caching - Controlling Save & Loading"
}