External Publication
Visit Post

Custom Local Dataset Caching - Controlling Save & Loading

Hugging Face Forums [Unofficial] June 12, 2026
Source

Hmm, I think it’s probably something like this:


The key distinction is that the generated cache directory/name is not usually the thing to load as if it were a normal dataset file.

There are several different “dataset-like things on disk” in datasets, and they are easy to mix up because more than one of them uses Arrow files internally.

The practical answer is:

  1. If the original JSON file is still available, run the same load_dataset("json", data_files=..., split=...) call again and let datasets reuse its prepared cache.
  2. If you want a stable local artifact for later scripts, explicitly call save_to_disk() and later reload it with load_from_disk().
  3. If you only have a raw .arrow file from the cache and want to inspect or recover it, use Dataset.from_file(path) as a lower-level escape hatch.

So I would not treat a generated cache folder/config-like name as a new dataset identifier. I would choose the loading method based on what kind of artifact I actually have.

Recommended pattern for a local JSON dataset

For the initial load from JSON:

from datasets import load_dataset

ds = load_dataset(
    "json",
    data_files="/Volumes/XXXX/MyDataset.json",
    split="train",
)

If the same source file and the same loading options are used again, datasets should normally be able to reuse the prepared cache. That is the normal cache-reuse route.

If the goal is to make the prepared dataset reusable in a more explicit and stable way, save it yourself:

from datasets import load_dataset, load_from_disk

ds = load_dataset(
    "json",
    data_files="/Volumes/XXXX/MyDataset.json",
    split="train",
)

ds.save_to_disk("/Volumes/XXXX/my_dataset_saved")

# Later, possibly in another script:
ds = load_from_disk("/Volumes/XXXX/my_dataset_saved")

That is usually the cleanest workflow when you want to avoid reparsing/repreparing the source data and you want a directory that is meant to be reloaded later.

Why the cache path is confusing

A rough map helps:

Raw source files
(JSON / CSV / Parquet / local folders / Hub dataset repo)
        |
        |  load_dataset(...)
        v
Prepared Dataset object
(Dataset / DatasetDict; Arrow-backed)
        |
        +-----------------------------+
        |                             |
        | automatic internal cache    | explicit local persistence
        |                             |
        v                             v
Datasets cache                  save_to_disk(...)
~/.cache/huggingface/datasets        |
Arrow files / indices                v
implementation detail          Saved dataset directory
                                Arrow + metadata/state
                                      |
                                      | load_from_disk(...)
                                      v
                                Reloaded Dataset object

Other exits:
  - single raw .arrow file  -> Dataset.from_file(...)
  - share / long-term store -> push_to_hub(...) / load_dataset("user/repo")
  - generic export          -> to_csv / to_json / to_parquet / to_sql

The important part is that both the internal cache and a save_to_disk() dataset may contain Arrow files, but they are not the same contract.

  • The internal cache is managed by datasets and is normally re-entered by repeating the same load_dataset() or transform call.
  • A save_to_disk() directory is an explicit local artifact intended to be read by load_from_disk().
  • A single.arrow file can be loaded with Dataset.from_file(path), but that is a lower-level/manual route.

The official docs describe the same split: load_dataset() loads from the Hub or local data files, runs a dataset builder, and processes/caches the dataset as typed Arrow tables. The cache docs also say that, by default, Datasets reuses an existing dataset if it exists. Separately, save_to_disk() / load_from_disk() are the explicit local save/reload APIs.

Which API goes with which artifact?

If you have… Usually use… Do not confuse it with…
Original JSON/CSV/Parquet files load_dataset("json"/"csv"/"parquet", data_files=...) A generated cache folder name
A prepared dataset object you want to reuse later dataset.save_to_disk(path) then load_from_disk(path) The automatic cache
A generated internal cache directory Usually repeat the same load_dataset() / transform call A public saved dataset layout
One raw .arrow file Dataset.from_file(path) A full DatasetDict saved by save_to_disk()
A dataset you want to share or keep long-term push_to_hub() / Parquet / load_dataset("user/repo") Local-only Arrow cache directories
A dataset you want to use outside datasets to_csv(), to_json(), to_parquet(), etc. A Datasets-native saved directory

About name=, data_files=, and generated cache names

The name= argument can affect the builder configuration/cache identity, but it does not turn the resulting cache folder into a new named dataset that should later be passed to data_files.

data_files is for actual source files or file patterns, for example:

load_dataset("json", data_files="/path/to/data.json")

or, if you truly have Arrow source files:

load_dataset("arrow", data_files="/path/to/data.arrow")

But if the thing you have is a generated cache folder name under something like ~/.cache/huggingface/datasets/..., that is usually not the right abstraction to pass back as data_files.

The generated cache path is closer to an implementation detail of the loading/preparation step than to a public dataset name.

If you really need to inspect a cache Arrow file

If there is a specific .arrow file and you want to inspect or recover it manually:

from datasets import Dataset

ds = Dataset.from_file("/full/path/to/file.arrow")

This can be useful as an escape hatch, especially when debugging or recovering a cache artifact.

However, I would not use it as the main persistence workflow unless there is a specific reason. A single Arrow file may not preserve the whole higher-level context you expected, especially if the original object was a DatasetDict, had indices, had multiple shards, or relied on other metadata.

For normal local reuse, prefer:

dataset.save_to_disk("/some/stable/path")
dataset = load_from_disk("/some/stable/path")

A simple robust local workflow

One practical pattern is:

from pathlib import Path
from datasets import load_dataset, load_from_disk

source_json = "/Volumes/XXXX/MyDataset.json"
saved_dir = "/Volumes/XXXX/my_dataset_saved"

if Path(saved_dir).exists():
    ds = load_from_disk(saved_dir)
else:
    ds = load_dataset(
        "json",
        data_files=source_json,
        split="train",
    )
    ds.save_to_disk(saved_dir)

This keeps the layers separate:

  • load_dataset(...) loads/builds from the original source data.
  • the internal cache is allowed to do its automatic job.
  • save_to_disk(...) creates an explicit reusable dataset directory.
  • load_from_disk(...) reloads that explicit directory.

A few practical edge cases

If the original local file moved

If the original JSON file moved, the exact same load_dataset("json", data_files=...) call may no longer describe the same source. In that case, either point data_files at the new real file path or use a previously saved dataset directory with load_from_disk().

If the dataset contains local media paths

For image/audio/video columns, be careful about whether the dataset stores actual media data or just local paths/URLs. If you move the dataset to another machine, path-based columns may break unless the paths still make sense there.

If the saved directory is on read-only storage

Loading may work from a read-only location, but later operations such as map(), filter(), shuffle(), or train_test_split() may need to write cache/index files. In that case, use a writable cache location or explicitly set the relevant cache/index file paths.

If this is for sharing or long-term storage

For local fast reloads, save_to_disk() is convenient. For sharing, remote reuse, and longer-term storage, a Hub dataset repo / Parquet route is often a better fit. The docs note that Arrow is fast for local disk reloads, but it is larger and less suited for upload/download/query than Parquet.

References

Official docs:

  • load_dataset() loading methods
  • Datasets cache management
  • Saving and reloading processed datasets
  • save_to_disk() / load_from_disk() reference
  • Dataset.from_file() reference

Related forum notes:

  • Same loading code should reload from cache
  • Saving processed data with save_to_disk, reloading with load_from_disk, and direct Arrow loading with Dataset.from_file
  • Directly loading a specific cache Arrow file with Dataset.from_file
  • Example of confusion between cache folders and load_from_disk()-style saved datasets

So the practical recommendation is: do not manually chase the internal cache directory unless you are debugging or recovering data. For normal reuse, either repeat the original load_dataset("json", ...) call, or create a stable saved dataset directory with save_to_disk() and reload it with load_from_disk().

Discussion in the ATmosphere

Loading comments...