Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigfyuk7xq66gzbpv6dloey6koylk7fxsajpwzonhulaktp7mfukda",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mo6ne2tdvj22"
  },
  "path": "/t/custom-local-dataset-caching-controlling-save-loading/176741#post_3",
  "publishedAt": "2026-06-13T15:49:18.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "I think same like **John6666** suggested for you, the confusion comes from the fact that the Arrow files created in _`~/.cache/huggingface/datasets`_ are an internal cache used by _`load_dataset()`_\n\nNot a dataset artifact that you are supposed to reload manually with _`load_dataset(“arrow”, …)`._\n\nSo if your JSON file loads successfully and you see _`dataset_info.json`_ plus _`.arrow`_ files in the cache, that usually just means the library processed your raw JSON into its internal format. It does ****not**** mean that the right next step is to point _`load_dataset()`_ directly at that cache folder.\n\nIf you want to reuse the processed dataset later in another script, the recommended approach is to explicitly save it with _`save_to_disk()`_ and then reload it with _`load_from_disk()`_ :\n\n_```python_\n\n_from datasets import load_dataset_\n\n_dataset = load_dataset(“json”,\ndata_files=“/Volumes/XXXX/MyDataset.json”,\nsplit=“train”,)_\n\n_dataset.save_to_disk(“/path/to/my_saved_dataset”)_\n\n```And Hello guys btw",
  "title": "Custom Local Dataset Caching - Controlling Save & Loading"
}