Custom Local Dataset Caching - Controlling Save & Loading
Hmm, I think it’s probably something like this:
The key distinction is that the generated cache directory/name is not usually the thing to load as if it were a normal dataset file.
There are several different “dataset-like things on disk” in datasets, and they are easy to mix up because more than one of them uses Arrow files internally.
The practical answer is:
- If the original JSON file is still available, run the same
load_dataset("json", data_files=..., split=...)call again and letdatasetsreuse its prepared cache. - If you want a stable local artifact for later scripts, explicitly call
save_to_disk()and later reload it withload_from_disk(). - If you only have a raw
.arrowfile from the cache and want to inspect or recover it, useDataset.from_file(path)as a lower-level escape hatch.
So I would not treat a generated cache folder/config-like name as a new dataset identifier. I would choose the loading method based on what kind of artifact I actually have.
Recommended pattern for a local JSON dataset
For the initial load from JSON:
from datasets import load_dataset
ds = load_dataset(
"json",
data_files="/Volumes/XXXX/MyDataset.json",
split="train",
)
If the same source file and the same loading options are used again, datasets should normally be able to reuse the prepared cache. That is the normal cache-reuse route.
If the goal is to make the prepared dataset reusable in a more explicit and stable way, save it yourself:
from datasets import load_dataset, load_from_disk
ds = load_dataset(
"json",
data_files="/Volumes/XXXX/MyDataset.json",
split="train",
)
ds.save_to_disk("/Volumes/XXXX/my_dataset_saved")
# Later, possibly in another script:
ds = load_from_disk("/Volumes/XXXX/my_dataset_saved")
That is usually the cleanest workflow when you want to avoid reparsing/repreparing the source data and you want a directory that is meant to be reloaded later.
Why the cache path is confusing
A rough map helps:
Raw source files
(JSON / CSV / Parquet / local folders / Hub dataset repo)
|
| load_dataset(...)
v
Prepared Dataset object
(Dataset / DatasetDict; Arrow-backed)
|
+-----------------------------+
| |
| automatic internal cache | explicit local persistence
| |
v v
Datasets cache save_to_disk(...)
~/.cache/huggingface/datasets |
Arrow files / indices v
implementation detail Saved dataset directory
Arrow + metadata/state
|
| load_from_disk(...)
v
Reloaded Dataset object
Other exits:
- single raw .arrow file -> Dataset.from_file(...)
- share / long-term store -> push_to_hub(...) / load_dataset("user/repo")
- generic export -> to_csv / to_json / to_parquet / to_sql
The important part is that both the internal cache and a save_to_disk() dataset may contain Arrow files, but they are not the same contract.
- The internal cache is managed by
datasetsand is normally re-entered by repeating the sameload_dataset()or transform call. - A
save_to_disk()directory is an explicit local artifact intended to be read byload_from_disk(). - A single
.arrowfile can be loaded withDataset.from_file(path), but that is a lower-level/manual route.
The official docs describe the same split: load_dataset() loads from the Hub or local data files, runs a dataset builder, and processes/caches the dataset as typed Arrow tables. The cache docs also say that, by default, Datasets reuses an existing dataset if it exists. Separately, save_to_disk() / load_from_disk() are the explicit local save/reload APIs.
Which API goes with which artifact?
| If you have… | Usually use… | Do not confuse it with… |
|---|---|---|
| Original JSON/CSV/Parquet files | load_dataset("json"/"csv"/"parquet", data_files=...) |
A generated cache folder name |
| A prepared dataset object you want to reuse later | dataset.save_to_disk(path) then load_from_disk(path) |
The automatic cache |
| A generated internal cache directory | Usually repeat the same load_dataset() / transform call |
A public saved dataset layout |
One raw .arrow file |
Dataset.from_file(path) |
A full DatasetDict saved by save_to_disk() |
| A dataset you want to share or keep long-term | push_to_hub() / Parquet / load_dataset("user/repo") |
Local-only Arrow cache directories |
A dataset you want to use outside datasets |
to_csv(), to_json(), to_parquet(), etc. |
A Datasets-native saved directory |
About name=, data_files=, and generated cache names
The name= argument can affect the builder configuration/cache identity, but it does not turn the resulting cache folder into a new named dataset that should later be passed to data_files.
data_files is for actual source files or file patterns, for example:
load_dataset("json", data_files="/path/to/data.json")
or, if you truly have Arrow source files:
load_dataset("arrow", data_files="/path/to/data.arrow")
But if the thing you have is a generated cache folder name under something like ~/.cache/huggingface/datasets/..., that is usually not the right abstraction to pass back as data_files.
The generated cache path is closer to an implementation detail of the loading/preparation step than to a public dataset name.
If you really need to inspect a cache Arrow file
If there is a specific .arrow file and you want to inspect or recover it manually:
from datasets import Dataset
ds = Dataset.from_file("/full/path/to/file.arrow")
This can be useful as an escape hatch, especially when debugging or recovering a cache artifact.
However, I would not use it as the main persistence workflow unless there is a specific reason. A single Arrow file may not preserve the whole higher-level context you expected, especially if the original object was a DatasetDict, had indices, had multiple shards, or relied on other metadata.
For normal local reuse, prefer:
dataset.save_to_disk("/some/stable/path")
dataset = load_from_disk("/some/stable/path")
A simple robust local workflow
One practical pattern is:
from pathlib import Path
from datasets import load_dataset, load_from_disk
source_json = "/Volumes/XXXX/MyDataset.json"
saved_dir = "/Volumes/XXXX/my_dataset_saved"
if Path(saved_dir).exists():
ds = load_from_disk(saved_dir)
else:
ds = load_dataset(
"json",
data_files=source_json,
split="train",
)
ds.save_to_disk(saved_dir)
This keeps the layers separate:
load_dataset(...)loads/builds from the original source data.- the internal cache is allowed to do its automatic job.
save_to_disk(...)creates an explicit reusable dataset directory.load_from_disk(...)reloads that explicit directory.
A few practical edge cases
If the original local file moved
If the original JSON file moved, the exact same load_dataset("json", data_files=...) call may no longer describe the same source. In that case, either point data_files at the new real file path or use a previously saved dataset directory with load_from_disk().
If the dataset contains local media paths
For image/audio/video columns, be careful about whether the dataset stores actual media data or just local paths/URLs. If you move the dataset to another machine, path-based columns may break unless the paths still make sense there.
If the saved directory is on read-only storage
Loading may work from a read-only location, but later operations such as map(), filter(), shuffle(), or train_test_split() may need to write cache/index files. In that case, use a writable cache location or explicitly set the relevant cache/index file paths.
If this is for sharing or long-term storage
For local fast reloads, save_to_disk() is convenient. For sharing, remote reuse, and longer-term storage, a Hub dataset repo / Parquet route is often a better fit. The docs note that Arrow is fast for local disk reloads, but it is larger and less suited for upload/download/query than Parquet.
References
Official docs:
- load_dataset() loading methods
- Datasets cache management
- Saving and reloading processed datasets
- save_to_disk() / load_from_disk() reference
- Dataset.from_file() reference
Related forum notes:
- Same loading code should reload from cache
- Saving processed data with save_to_disk, reloading with load_from_disk, and direct Arrow loading with Dataset.from_file
- Directly loading a specific cache Arrow file with Dataset.from_file
- Example of confusion between cache folders and load_from_disk()-style saved datasets
So the practical recommendation is: do not manually chase the internal cache directory unless you are debugging or recovering data. For normal reuse, either repeat the original load_dataset("json", ...) call, or create a stable saved dataset directory with save_to_disk() and reload it with load_from_disk().
Discussion in the ATmosphere