Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifqbqykwmfah3net7mxglrvbnvx64ha224wryzq5jf7eqvorwgih4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mo4xnenxnbs2"
  },
  "path": "/t/custom-local-dataset-caching-controlling-save-loading/176741#post_2",
  "publishedAt": "2026-06-12T23:05:24.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "load_dataset() loading methods",
    "Datasets cache management",
    "Saving and reloading processed datasets",
    "save_to_disk() / load_from_disk() reference",
    "Dataset.from_file() reference",
    "Same loading code should reload from cache",
    "Saving processed data with save_to_disk, reloading with load_from_disk, and direct Arrow loading with Dataset.from_file",
    "Directly loading a specific cache Arrow file with Dataset.from_file",
    "Example of confusion between cache folders and load_from_disk()-style saved datasets"
  ],
  "textContent": "Hmm, I think it’s probably something like this:\n\n* * *\n\nThe key distinction is that the generated cache directory/name is not usually the thing to load as if it were a normal dataset file.\n\nThere are several different “dataset-like things on disk” in `datasets`, and they are easy to mix up because more than one of them uses Arrow files internally.\n\nThe practical answer is:\n\n  1. If the original JSON file is still available, run the same `load_dataset(\"json\", data_files=..., split=...)` call again and let `datasets` reuse its prepared cache.\n  2. If you want a stable local artifact for later scripts, explicitly call `save_to_disk()` and later reload it with `load_from_disk()`.\n  3. If you only have a raw `.arrow` file from the cache and want to inspect or recover it, use `Dataset.from_file(path)` as a lower-level escape hatch.\n\n\n\nSo I would not treat a generated cache folder/config-like name as a new dataset identifier. I would choose the loading method based on what kind of artifact I actually have.\n\n## Recommended pattern for a local JSON dataset\n\nFor the initial load from JSON:\n\n\n    from datasets import load_dataset\n\n    ds = load_dataset(\n        \"json\",\n        data_files=\"/Volumes/XXXX/MyDataset.json\",\n        split=\"train\",\n    )\n\n\nIf the same source file and the same loading options are used again, `datasets` should normally be able to reuse the prepared cache. That is the normal cache-reuse route.\n\nIf the goal is to make the prepared dataset reusable in a more explicit and stable way, save it yourself:\n\n\n    from datasets import load_dataset, load_from_disk\n\n    ds = load_dataset(\n        \"json\",\n        data_files=\"/Volumes/XXXX/MyDataset.json\",\n        split=\"train\",\n    )\n\n    ds.save_to_disk(\"/Volumes/XXXX/my_dataset_saved\")\n\n    # Later, possibly in another script:\n    ds = load_from_disk(\"/Volumes/XXXX/my_dataset_saved\")\n\n\nThat is usually the cleanest workflow when you want to avoid reparsing/repreparing the source data and you want a directory that is meant to be reloaded later.\n\n## Why the cache path is confusing\n\nA rough map helps:\n\n\n    Raw source files\n    (JSON / CSV / Parquet / local folders / Hub dataset repo)\n            |\n            |  load_dataset(...)\n            v\n    Prepared Dataset object\n    (Dataset / DatasetDict; Arrow-backed)\n            |\n            +-----------------------------+\n            |                             |\n            | automatic internal cache    | explicit local persistence\n            |                             |\n            v                             v\n    Datasets cache                  save_to_disk(...)\n    ~/.cache/huggingface/datasets        |\n    Arrow files / indices                v\n    implementation detail          Saved dataset directory\n                                    Arrow + metadata/state\n                                          |\n                                          | load_from_disk(...)\n                                          v\n                                    Reloaded Dataset object\n\n    Other exits:\n      - single raw .arrow file  -> Dataset.from_file(...)\n      - share / long-term store -> push_to_hub(...) / load_dataset(\"user/repo\")\n      - generic export          -> to_csv / to_json / to_parquet / to_sql\n\n\nThe important part is that both the internal cache and a `save_to_disk()` dataset may contain Arrow files, but they are not the same contract.\n\n  * The **internal cache** is managed by `datasets` and is normally re-entered by repeating the same `load_dataset()` or transform call.\n  * A **`save_to_disk()` directory** is an explicit local artifact intended to be read by `load_from_disk()`.\n  * A **single`.arrow` file** can be loaded with `Dataset.from_file(path)`, but that is a lower-level/manual route.\n\n\n\nThe official docs describe the same split: `load_dataset()` loads from the Hub or local data files, runs a dataset builder, and processes/caches the dataset as typed Arrow tables. The cache docs also say that, by default, Datasets reuses an existing dataset if it exists. Separately, `save_to_disk()` / `load_from_disk()` are the explicit local save/reload APIs.\n\n## Which API goes with which artifact?\n\nIf you have… | Usually use… | Do not confuse it with…\n---|---|---\nOriginal JSON/CSV/Parquet files | `load_dataset(\"json\"/\"csv\"/\"parquet\", data_files=...)` | A generated cache folder name\nA prepared dataset object you want to reuse later | `dataset.save_to_disk(path)` then `load_from_disk(path)` | The automatic cache\nA generated internal cache directory | Usually repeat the same `load_dataset()` / transform call | A public saved dataset layout\nOne raw `.arrow` file | `Dataset.from_file(path)` | A full `DatasetDict` saved by `save_to_disk()`\nA dataset you want to share or keep long-term | `push_to_hub()` / Parquet / `load_dataset(\"user/repo\")` | Local-only Arrow cache directories\nA dataset you want to use outside `datasets` | `to_csv()`, `to_json()`, `to_parquet()`, etc. | A Datasets-native saved directory\n\n## About `name=`, `data_files=`, and generated cache names\n\nThe `name=` argument can affect the builder configuration/cache identity, but it does not turn the resulting cache folder into a new named dataset that should later be passed to `data_files`.\n\n`data_files` is for actual source files or file patterns, for example:\n\n\n    load_dataset(\"json\", data_files=\"/path/to/data.json\")\n\n\nor, if you truly have Arrow source files:\n\n\n    load_dataset(\"arrow\", data_files=\"/path/to/data.arrow\")\n\n\nBut if the thing you have is a generated cache folder name under something like `~/.cache/huggingface/datasets/...`, that is usually not the right abstraction to pass back as `data_files`.\n\nThe generated cache path is closer to an implementation detail of the loading/preparation step than to a public dataset name.\n\n## If you really need to inspect a cache Arrow file\n\nIf there is a specific `.arrow` file and you want to inspect or recover it manually:\n\n\n    from datasets import Dataset\n\n    ds = Dataset.from_file(\"/full/path/to/file.arrow\")\n\n\nThis can be useful as an escape hatch, especially when debugging or recovering a cache artifact.\n\nHowever, I would not use it as the main persistence workflow unless there is a specific reason. A single Arrow file may not preserve the whole higher-level context you expected, especially if the original object was a `DatasetDict`, had indices, had multiple shards, or relied on other metadata.\n\nFor normal local reuse, prefer:\n\n\n    dataset.save_to_disk(\"/some/stable/path\")\n    dataset = load_from_disk(\"/some/stable/path\")\n\n\n## A simple robust local workflow\n\nOne practical pattern is:\n\n\n    from pathlib import Path\n    from datasets import load_dataset, load_from_disk\n\n    source_json = \"/Volumes/XXXX/MyDataset.json\"\n    saved_dir = \"/Volumes/XXXX/my_dataset_saved\"\n\n    if Path(saved_dir).exists():\n        ds = load_from_disk(saved_dir)\n    else:\n        ds = load_dataset(\n            \"json\",\n            data_files=source_json,\n            split=\"train\",\n        )\n        ds.save_to_disk(saved_dir)\n\n\nThis keeps the layers separate:\n\n  * `load_dataset(...)` loads/builds from the original source data.\n  * the internal cache is allowed to do its automatic job.\n  * `save_to_disk(...)` creates an explicit reusable dataset directory.\n  * `load_from_disk(...)` reloads that explicit directory.\n\n\n\n## A few practical edge cases\n\n### If the original local file moved\n\nIf the original JSON file moved, the exact same `load_dataset(\"json\", data_files=...)` call may no longer describe the same source. In that case, either point `data_files` at the new real file path or use a previously saved dataset directory with `load_from_disk()`.\n\n### If the dataset contains local media paths\n\nFor image/audio/video columns, be careful about whether the dataset stores actual media data or just local paths/URLs. If you move the dataset to another machine, path-based columns may break unless the paths still make sense there.\n\n### If the saved directory is on read-only storage\n\nLoading may work from a read-only location, but later operations such as `map()`, `filter()`, `shuffle()`, or `train_test_split()` may need to write cache/index files. In that case, use a writable cache location or explicitly set the relevant cache/index file paths.\n\n### If this is for sharing or long-term storage\n\nFor local fast reloads, `save_to_disk()` is convenient. For sharing, remote reuse, and longer-term storage, a Hub dataset repo / Parquet route is often a better fit. The docs note that Arrow is fast for local disk reloads, but it is larger and less suited for upload/download/query than Parquet.\n\n## References\n\nOfficial docs:\n\n  * load_dataset() loading methods\n  * Datasets cache management\n  * Saving and reloading processed datasets\n  * save_to_disk() / load_from_disk() reference\n  * Dataset.from_file() reference\n\n\n\nRelated forum notes:\n\n  * Same loading code should reload from cache\n  * Saving processed data with save_to_disk, reloading with load_from_disk, and direct Arrow loading with Dataset.from_file\n  * Directly loading a specific cache Arrow file with Dataset.from_file\n  * Example of confusion between cache folders and load_from_disk()-style saved datasets\n\n\n\nSo the practical recommendation is: do not manually chase the internal cache directory unless you are debugging or recovering data. For normal reuse, either repeat the original `load_dataset(\"json\", ...)` call, or create a stable saved dataset directory with `save_to_disk()` and reload it with `load_from_disk()`.",
  "title": "Custom Local Dataset Caching - Controlling Save & Loading"
}