{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifqbqykwmfah3net7mxglrvbnvx64ha224wryzq5jf7eqvorwgih4",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mo4xnenxnbs2"
},
"path": "/t/custom-local-dataset-caching-controlling-save-loading/176741#post_2",
"publishedAt": "2026-06-12T23:05:24.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"load_dataset() loading methods",
"Datasets cache management",
"Saving and reloading processed datasets",
"save_to_disk() / load_from_disk() reference",
"Dataset.from_file() reference",
"Same loading code should reload from cache",
"Saving processed data with save_to_disk, reloading with load_from_disk, and direct Arrow loading with Dataset.from_file",
"Directly loading a specific cache Arrow file with Dataset.from_file",
"Example of confusion between cache folders and load_from_disk()-style saved datasets"
],
"textContent": "Hmm, I think it’s probably something like this:\n\n* * *\n\nThe key distinction is that the generated cache directory/name is not usually the thing to load as if it were a normal dataset file.\n\nThere are several different “dataset-like things on disk” in `datasets`, and they are easy to mix up because more than one of them uses Arrow files internally.\n\nThe practical answer is:\n\n 1. If the original JSON file is still available, run the same `load_dataset(\"json\", data_files=..., split=...)` call again and let `datasets` reuse its prepared cache.\n 2. If you want a stable local artifact for later scripts, explicitly call `save_to_disk()` and later reload it with `load_from_disk()`.\n 3. If you only have a raw `.arrow` file from the cache and want to inspect or recover it, use `Dataset.from_file(path)` as a lower-level escape hatch.\n\n\n\nSo I would not treat a generated cache folder/config-like name as a new dataset identifier. I would choose the loading method based on what kind of artifact I actually have.\n\n## Recommended pattern for a local JSON dataset\n\nFor the initial load from JSON:\n\n\n from datasets import load_dataset\n\n ds = load_dataset(\n \"json\",\n data_files=\"/Volumes/XXXX/MyDataset.json\",\n split=\"train\",\n )\n\n\nIf the same source file and the same loading options are used again, `datasets` should normally be able to reuse the prepared cache. That is the normal cache-reuse route.\n\nIf the goal is to make the prepared dataset reusable in a more explicit and stable way, save it yourself:\n\n\n from datasets import load_dataset, load_from_disk\n\n ds = load_dataset(\n \"json\",\n data_files=\"/Volumes/XXXX/MyDataset.json\",\n split=\"train\",\n )\n\n ds.save_to_disk(\"/Volumes/XXXX/my_dataset_saved\")\n\n # Later, possibly in another script:\n ds = load_from_disk(\"/Volumes/XXXX/my_dataset_saved\")\n\n\nThat is usually the cleanest workflow when you want to avoid reparsing/repreparing the source data and you want a directory that is meant to be reloaded later.\n\n## Why the cache path is confusing\n\nA rough map helps:\n\n\n Raw source files\n (JSON / CSV / Parquet / local folders / Hub dataset repo)\n |\n | load_dataset(...)\n v\n Prepared Dataset object\n (Dataset / DatasetDict; Arrow-backed)\n |\n +-----------------------------+\n | |\n | automatic internal cache | explicit local persistence\n | |\n v v\n Datasets cache save_to_disk(...)\n ~/.cache/huggingface/datasets |\n Arrow files / indices v\n implementation detail Saved dataset directory\n Arrow + metadata/state\n |\n | load_from_disk(...)\n v\n Reloaded Dataset object\n\n Other exits:\n - single raw .arrow file -> Dataset.from_file(...)\n - share / long-term store -> push_to_hub(...) / load_dataset(\"user/repo\")\n - generic export -> to_csv / to_json / to_parquet / to_sql\n\n\nThe important part is that both the internal cache and a `save_to_disk()` dataset may contain Arrow files, but they are not the same contract.\n\n * The **internal cache** is managed by `datasets` and is normally re-entered by repeating the same `load_dataset()` or transform call.\n * A **`save_to_disk()` directory** is an explicit local artifact intended to be read by `load_from_disk()`.\n * A **single`.arrow` file** can be loaded with `Dataset.from_file(path)`, but that is a lower-level/manual route.\n\n\n\nThe official docs describe the same split: `load_dataset()` loads from the Hub or local data files, runs a dataset builder, and processes/caches the dataset as typed Arrow tables. The cache docs also say that, by default, Datasets reuses an existing dataset if it exists. Separately, `save_to_disk()` / `load_from_disk()` are the explicit local save/reload APIs.\n\n## Which API goes with which artifact?\n\nIf you have… | Usually use… | Do not confuse it with…\n---|---|---\nOriginal JSON/CSV/Parquet files | `load_dataset(\"json\"/\"csv\"/\"parquet\", data_files=...)` | A generated cache folder name\nA prepared dataset object you want to reuse later | `dataset.save_to_disk(path)` then `load_from_disk(path)` | The automatic cache\nA generated internal cache directory | Usually repeat the same `load_dataset()` / transform call | A public saved dataset layout\nOne raw `.arrow` file | `Dataset.from_file(path)` | A full `DatasetDict` saved by `save_to_disk()`\nA dataset you want to share or keep long-term | `push_to_hub()` / Parquet / `load_dataset(\"user/repo\")` | Local-only Arrow cache directories\nA dataset you want to use outside `datasets` | `to_csv()`, `to_json()`, `to_parquet()`, etc. | A Datasets-native saved directory\n\n## About `name=`, `data_files=`, and generated cache names\n\nThe `name=` argument can affect the builder configuration/cache identity, but it does not turn the resulting cache folder into a new named dataset that should later be passed to `data_files`.\n\n`data_files` is for actual source files or file patterns, for example:\n\n\n load_dataset(\"json\", data_files=\"/path/to/data.json\")\n\n\nor, if you truly have Arrow source files:\n\n\n load_dataset(\"arrow\", data_files=\"/path/to/data.arrow\")\n\n\nBut if the thing you have is a generated cache folder name under something like `~/.cache/huggingface/datasets/...`, that is usually not the right abstraction to pass back as `data_files`.\n\nThe generated cache path is closer to an implementation detail of the loading/preparation step than to a public dataset name.\n\n## If you really need to inspect a cache Arrow file\n\nIf there is a specific `.arrow` file and you want to inspect or recover it manually:\n\n\n from datasets import Dataset\n\n ds = Dataset.from_file(\"/full/path/to/file.arrow\")\n\n\nThis can be useful as an escape hatch, especially when debugging or recovering a cache artifact.\n\nHowever, I would not use it as the main persistence workflow unless there is a specific reason. A single Arrow file may not preserve the whole higher-level context you expected, especially if the original object was a `DatasetDict`, had indices, had multiple shards, or relied on other metadata.\n\nFor normal local reuse, prefer:\n\n\n dataset.save_to_disk(\"/some/stable/path\")\n dataset = load_from_disk(\"/some/stable/path\")\n\n\n## A simple robust local workflow\n\nOne practical pattern is:\n\n\n from pathlib import Path\n from datasets import load_dataset, load_from_disk\n\n source_json = \"/Volumes/XXXX/MyDataset.json\"\n saved_dir = \"/Volumes/XXXX/my_dataset_saved\"\n\n if Path(saved_dir).exists():\n ds = load_from_disk(saved_dir)\n else:\n ds = load_dataset(\n \"json\",\n data_files=source_json,\n split=\"train\",\n )\n ds.save_to_disk(saved_dir)\n\n\nThis keeps the layers separate:\n\n * `load_dataset(...)` loads/builds from the original source data.\n * the internal cache is allowed to do its automatic job.\n * `save_to_disk(...)` creates an explicit reusable dataset directory.\n * `load_from_disk(...)` reloads that explicit directory.\n\n\n\n## A few practical edge cases\n\n### If the original local file moved\n\nIf the original JSON file moved, the exact same `load_dataset(\"json\", data_files=...)` call may no longer describe the same source. In that case, either point `data_files` at the new real file path or use a previously saved dataset directory with `load_from_disk()`.\n\n### If the dataset contains local media paths\n\nFor image/audio/video columns, be careful about whether the dataset stores actual media data or just local paths/URLs. If you move the dataset to another machine, path-based columns may break unless the paths still make sense there.\n\n### If the saved directory is on read-only storage\n\nLoading may work from a read-only location, but later operations such as `map()`, `filter()`, `shuffle()`, or `train_test_split()` may need to write cache/index files. In that case, use a writable cache location or explicitly set the relevant cache/index file paths.\n\n### If this is for sharing or long-term storage\n\nFor local fast reloads, `save_to_disk()` is convenient. For sharing, remote reuse, and longer-term storage, a Hub dataset repo / Parquet route is often a better fit. The docs note that Arrow is fast for local disk reloads, but it is larger and less suited for upload/download/query than Parquet.\n\n## References\n\nOfficial docs:\n\n * load_dataset() loading methods\n * Datasets cache management\n * Saving and reloading processed datasets\n * save_to_disk() / load_from_disk() reference\n * Dataset.from_file() reference\n\n\n\nRelated forum notes:\n\n * Same loading code should reload from cache\n * Saving processed data with save_to_disk, reloading with load_from_disk, and direct Arrow loading with Dataset.from_file\n * Directly loading a specific cache Arrow file with Dataset.from_file\n * Example of confusion between cache folders and load_from_disk()-style saved datasets\n\n\n\nSo the practical recommendation is: do not manually chase the internal cache directory unless you are debugging or recovering data. For normal reuse, either repeat the original `load_dataset(\"json\", ...)` call, or create a stable saved dataset directory with `save_to_disk()` and reload it with `load_from_disk()`.",
"title": "Custom Local Dataset Caching - Controlling Save & Loading"
}