Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigtkuh4wtppmcjlef4xqy54cxp2kz6jk6fyiydoccy4e6ks43hipa",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkeahbyxgke2"
  },
  "path": "/t/load-dataset-creates-a-duplicate-in-cache/175561#post_2",
  "publishedAt": "2026-04-25T22:48:26.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "@lhoestq",
    "Hugging Face",
    "Apache Arrow",
    "Hugging Face Forums"
  ],
  "textContent": "I think that was one of those cases where there really wasn’t a clean, library-level workaround other than using `IterableDataset`…\n\nIt’s a common headache with HF libraries: they rely on caching mechanisms, so while they offer performance benefits, it’s hard to just turn them off…\n\njust in case, @lhoestq\n\n* * *\n\n## Why `load_dataset(\"parquet\", data_dir=...)` creates another ~500 GB copy\n\nYour local Parquet files are **not being downloaded again**. They are being used as **source files** to build a normal Hugging Face `Dataset`.\n\nThe key distinction is:\n\n\n    /path_to_data/*.parquet\n        = your original storage/source format\n\n    /path_to_cache/datasets/parquet/...\n        = Hugging Face Datasets' prepared Arrow cache\n\n\nA regular non-streaming Hugging Face `Dataset` is Arrow-backed. The loading path prepares the dataset as Arrow files in the Datasets cache, while `streaming=True` takes a different path and avoids that full preparation step. Hugging Face’s loading docs describe the builder as able to “download and prepare the dataset as Arrow files in the cache” or “get a streaming dataset without downloading or caching anything.” (Hugging Face)\n\nSo this:\n\n\n    from datasets import load_dataset\n\n    ds = load_dataset(\"parquet\", data_dir=\"/path_to_data\")\n\n\nroughly means:\n\n\n    local Parquet shards\n        ↓ read/decode\n    Arrow cache files\n        ↓ memory-mapped / accessed by Dataset\n    regular map-style Dataset\n\n\nIt does **not** mean:\n\n\n    regular Dataset directly points at compressed Parquet files in-place\n\n\nThat is why your disk usage doubles.\n\n* * *\n\n## Why Parquet is not used directly as the regular `Dataset` backend\n\nParquet and Arrow solve different problems.\n\nFormat | Main role | Why it matters here\n---|---|---\n**Parquet** | Compressed storage / interchange / data lake format | Efficient on disk, but must be decoded before use\n**Arrow** | Runtime table / memory-oriented columnar format | Better suited to fast indexing, slicing, mapping, and memory mapping\n\nApache Arrow’s docs explain the lower-level reason: Parquet data must be decoded from Parquet format and compression, so it cannot be directly mapped from disk like an Arrow IPC-style file; `memory_map=True` may help some systems but does not remove the decoding/resident-memory issue. (Apache Arrow)\n\nApache Arrow’s FAQ makes the same conceptual split: Parquet is optimized for storage efficiency with compression and encoding, while Arrow is laid out for direct and efficient computation. (Apache Arrow)\n\nThat is the core cause of the “duplicate”:\n\n\n    Parquet is your compact source format.\n    Arrow is the normal Hugging Face Dataset runtime format.\n\n\nThe Datasets cache is therefore not just a download cache. Hugging Face’s cache docs distinguish the Hub cache, which stores downloaded Hub files, from the Datasets cache, which stores datasets converted into Arrow format. (Hugging Face)\n\n* * *\n\n## What will not solve it\n\n### `cache_dir=None`\n\nThis does not mean “no cache.” It only leaves the cache behavior at the default location or avoids overriding it.\n\nUse `cache_dir` to **move** the Arrow cache:\n\n\n    ds = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_data\",\n        cache_dir=\"/mnt/big_disk/hf_datasets_cache\",\n    )\n\n\nBut this still creates the Arrow cache.\n\n* * *\n\n### `datasets.disable_caching()`\n\nThis is a common trap. It does **not** stop the initial non-streaming `load_dataset()` preparation cache. A related Hugging Face forum discussion confirms the distinction: disabling caching affects transform-style cache behavior such as `.map()` / `.filter()` intermediates, but `load_dataset()` still writes the original prepared dataset cache. (Hugging Face Forums)\n\nSo this is not enough:\n\n\n    import datasets\n    from datasets import load_dataset\n\n    datasets.disable_caching()\n\n    ds = load_dataset(\"parquet\", data_dir=\"/path_to_data\")\n\n\n* * *\n\n### `keep_in_memory=True`\n\nFor ~500 GB, this usually makes the problem worse. It shifts pressure from disk to RAM. Unless you have unusually large memory and a narrow one-off workload, avoid it.\n\n* * *\n\n## Real solution categories\n\nThere is no single flag that gives all of these at once:\n\n\n    regular Dataset\n    + random access\n    + no Arrow cache\n    + direct compressed-Parquet backing\n\n\nYou need to pick one of three strategies.\n\n* * *\n\n# Strategy 1 — Avoid the Arrow copy with streaming\n\nThis is the cleanest answer if disk space is the main constraint.\n\n\n    from datasets import load_dataset\n\n    ds = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_data\",\n        split=\"train\",\n        streaming=True,\n    )\n\n\nThis returns an `IterableDataset`, not a regular `Dataset`.\n\nUse this when you are:\n\n  * training over examples,\n  * tokenizing,\n  * scanning,\n  * filtering,\n  * computing statistics,\n  * converting to another format,\n  * doing mostly sequential reads.\n\n\n\nHugging Face’s streaming docs explicitly describe streaming local files without conversion, including cases where Arrow conversion would take too long or exceed available disk. (Hugging Face)\n\nThe tradeoff is API behavior. Hugging Face’s map-style-vs-iterable guide says `IterableDataset` is ideal for very large datasets, including hundreds of GB, while regular `Dataset` is better when you need normal indexed behavior. (Hugging Face)\n\n### Add column selection immediately\n\nFor Parquet, this is especially important:\n\n\n    ds = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_data\",\n        split=\"train\",\n        streaming=True,\n        columns=[\"text\", \"label\"],\n    )\n\n\nParquet is columnar, and the streaming docs note that `columns` and `filters` can be used to stream only selected columns and apply filtering. (Hugging Face)\n\n### Add filters when useful\n\n\n    ds = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_data\",\n        split=\"train\",\n        streaming=True,\n        columns=[\"text\", \"label\", \"quality_score\"],\n        filters=[(\"quality_score\", \">=\", 0.8)],\n    )\n\n\nThis works best if your Parquet files have useful row-group statistics or partitioning. If every row group contains mixed values, filtering may still require substantial scanning.\n\n### Shuffle carefully\n\n\n    ds = ds.shuffle(seed=42, buffer_size=100_000)\n\n\nThis is a buffer shuffle, not a perfect global shuffle. For training it is often acceptable, but if your shards are sorted by source, time, label, project, or language, also randomize shard order or re-shard the data.\n\n* * *\n\n# Strategy 2 — Avoid `streaming=True`, but reduce what gets materialized\n\nIf you need a normal map-style `Dataset`, then assume Arrow cache is unavoidable. Your job is to make it smaller.\n\n## Load only required columns\n\n\n    from datasets import load_dataset\n\n    ds = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_data\",\n        split=\"train\",\n        columns=[\"id\", \"text\", \"label\"],\n        cache_dir=\"/mnt/big_disk/hf_datasets_cache\",\n    )\n\n\nThis still writes Arrow cache, but only for selected columns.\n\nAvoid this pattern:\n\n\n    ds = load_dataset(\"parquet\", data_dir=\"/path_to_data\", split=\"train\")\n    ds = ds.remove_columns([\"huge_unused_column\"])\n\n\nBy the time `remove_columns()` runs, the huge column may already have been materialized into Arrow cache.\n\n* * *\n\n## Use filters during load, not after full load\n\nPrefer:\n\n\n    ds = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_data\",\n        split=\"train\",\n        columns=[\"id\", \"text\", \"label\", \"quality_score\"],\n        filters=[(\"quality_score\", \">=\", 0.8)],\n        cache_dir=\"/mnt/big_disk/hf_datasets_cache\",\n    )\n\n\nAvoid:\n\n\n    ds = load_dataset(\"parquet\", data_dir=\"/path_to_data\", split=\"train\")\n    ds = ds.filter(lambda x: x[\"quality_score\"] >= 0.8)\n\n\nThe second version may first materialize the full dataset, then filter it.\n\n* * *\n\n## Pre-reduce the Parquet before Hugging Face Datasets\n\nThis is often the best non-streaming workaround.\n\nUse DuckDB or PyArrow to create a smaller Parquet dataset first:\n\n\n    import duckdb\n\n    duckdb.sql(\"\"\"\n    COPY (\n        SELECT\n            id,\n            text,\n            label\n        FROM read_parquet('/path_to_data/*.parquet')\n        WHERE text IS NOT NULL\n          AND quality_score >= 0.8\n    ) TO '/path_to_reduced_data/reduced.parquet'\n    (FORMAT PARQUET)\n    \"\"\")\n\n\nThen load the reduced dataset normally:\n\n\n    from datasets import load_dataset\n\n    ds = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_reduced_data\",\n        split=\"train\",\n        cache_dir=\"/mnt/big_disk/hf_datasets_cache\",\n    )\n\n\nDuckDB’s Parquet docs describe projection and filter pushdown for Parquet scans, which is exactly what you want before handing the data to a map-style `Dataset`. (Hugging Face)\n\nThis changes the storage picture from:\n\n\n    500 GB raw Parquet\n    + ~500 GB Arrow cache\n\n\nto something more like:\n\n\n    500 GB raw Parquet\n    + smaller reduced Parquet\n    + smaller Arrow cache\n\n\n* * *\n\n# Strategy 3 — Accept Arrow, but manage it deliberately\n\nIf you need full `Dataset` behavior, accept the Arrow copy and make it intentional.\n\n## Put the cache on a large disk\n\n\n    export HF_DATASETS_CACHE=\"/mnt/big_nvme/hf_datasets_cache\"\n\n\nor:\n\n\n    ds = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_data\",\n        split=\"train\",\n        cache_dir=\"/mnt/big_nvme/hf_datasets_cache\",\n    )\n\n\nThis does not save storage. It prevents accidental cache growth under your home directory or system disk.\n\n* * *\n\n## Convert once, then save a named prepared dataset\n\n\n    from datasets import load_dataset\n\n    ds = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_data\",\n        split=\"train\",\n        columns=[\"id\", \"text\", \"label\"],\n        cache_dir=\"/mnt/scratch/hf_build_cache\",\n    )\n\n    ds.save_to_disk(\n        \"/mnt/datasets/my_dataset_arrow_v1\",\n        max_shard_size=\"2GB\",\n    )\n\n\nFuture runs should use:\n\n\n    from datasets import load_from_disk\n\n    ds = load_from_disk(\"/mnt/datasets/my_dataset_arrow_v1\")\n\n\nThis is cleaner than repeatedly rebuilding from raw Parquet. The Datasets docs cover saving and reloading prepared datasets via `save_to_disk()` / `load_from_disk()`. (Hugging Face)\n\nBe careful: during conversion you may temporarily have three large things:\n\n\n    1. raw Parquet source\n    2. temporary HF Arrow cache\n    3. saved Arrow dataset artifact\n\n\nAfter validating the saved artifact, remove the temporary build cache if it is no longer needed.\n\n* * *\n\n# Strategy 4 — Stream raw data into final processed shards\n\nIf your real goal is tokenization or preprocessing, do not materialize the raw dataset as Arrow first.\n\nBetter pipeline:\n\n\n    raw Parquet\n        -> streaming read\n        -> tokenize/process\n        -> write final processed Parquet or Arrow shards\n\n\nExample skeleton:\n\n\n    from datasets import load_dataset\n    import pyarrow as pa\n    import pyarrow.parquet as pq\n    from pathlib import Path\n\n    raw = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_data\",\n        split=\"train\",\n        streaming=True,\n        columns=[\"id\", \"text\"],\n    )\n\n    out_dir = Path(\"/path_to_tokenized_parquet\")\n    out_dir.mkdir(parents=True, exist_ok=True)\n\n    buffer = []\n    shard_id = 0\n    rows_per_shard = 100_000\n\n    def tokenize_text(text):\n        # Replace with your tokenizer.\n        return {\n            \"input_ids\": [1, 2, 3],\n            \"attention_mask\": [1, 1, 1],\n        }\n\n    for row in raw:\n        encoded = tokenize_text(row[\"text\"])\n        buffer.append({\n            \"id\": row[\"id\"],\n            \"input_ids\": encoded[\"input_ids\"],\n            \"attention_mask\": encoded[\"attention_mask\"],\n        })\n\n        if len(buffer) >= rows_per_shard:\n            table = pa.Table.from_pylist(buffer)\n            pq.write_table(table, out_dir / f\"part-{shard_id:05d}.parquet\")\n            buffer.clear()\n            shard_id += 1\n\n    if buffer:\n        table = pa.Table.from_pylist(buffer)\n        pq.write_table(table, out_dir / f\"part-{shard_id:05d}.parquet\")\n\n\nThis avoids:\n\n\n    raw Parquet\n        -> full raw Arrow cache\n        -> tokenized Arrow cache\n        -> final saved copy\n\n\nand instead creates only the processed artifact you actually need.\n\n* * *\n\n# How I would decide in your case\n\n## If disk is tight\n\nUse:\n\n\n    ds = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_data\",\n        split=\"train\",\n        streaming=True,\n        columns=[\"needed_col_1\", \"needed_col_2\"],\n    )\n\n\nThis is the closest to “reference my local Parquet without another 500 GB copy.”\n\n* * *\n\n## If you need a regular `Dataset`\n\nUse:\n\n\n    ds = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_data\",\n        split=\"train\",\n        columns=[\"needed_col_1\", \"needed_col_2\"],\n        filters=[(\"quality_score\", \">=\", 0.8)],  # if applicable\n        cache_dir=\"/mnt/big_disk/hf_datasets_cache\",\n    )\n\n\nThis still creates Arrow cache, but reduces and relocates it.\n\n* * *\n\n## If you repeatedly use the same dataset\n\nConvert once:\n\n\n    ds.save_to_disk(\"/mnt/datasets/my_dataset_arrow_v1\")\n\n\nThen reuse:\n\n\n    ds = load_from_disk(\"/mnt/datasets/my_dataset_arrow_v1\")\n\n\nDo not rebuild the Arrow cache from raw Parquet for every experiment.\n\n* * *\n\n## If your task is preprocessing/tokenization\n\nPrefer:\n\n\n    stream raw Parquet -> write final processed shards\n\n\nrather than:\n\n\n    raw Parquet -> full raw Arrow cache -> processed cache -> final output\n\n\n* * *\n\n# Quick diagnostic checks\n\n## Check what backing files were created\n\n\n    ds = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_data\",\n        split=\"train\",\n    )\n\n    print(ds.cache_files[:5])\n\n\nIf you see `.arrow` files under the Datasets cache, that is the normal prepared dataset backend.\n\n* * *\n\n## Test streaming behavior safely\n\n\n    from itertools import islice\n    from datasets import load_dataset\n\n    ds = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_data\",\n        split=\"train\",\n        streaming=True,\n    )\n\n    print(type(ds))\n\n    for row in islice(ds, 5):\n        print(row.keys())\n\n\nYou should get an iterable dataset and no full ~500 GB Arrow cache.\n\n* * *\n\n## Estimate non-streaming expansion on a small subset\n\nDo not test the full 500 GB first. Use 1–5% of shards:\n\n\n    ds = load_dataset(\n        \"parquet\",\n        data_dir=\"/path_to_small_subset\",\n        split=\"train\",\n        columns=[\"id\", \"text\", \"label\"],\n        cache_dir=\"/tmp/hf_cache_test\",\n    )\n\n    print(ds.cache_files[:3])\n\n\nThen compare:\n\n\n    small Parquet size\n    vs\n    small Arrow cache size\n\n\nUse that ratio to estimate the full run.\n\n* * *\n\n# Bottom line\n\nThe extra ~500 GB appears because non-streaming `load_dataset(\"parquet\", data_dir=...)` builds a regular Arrow-backed Hugging Face `Dataset`. Your local Parquet files are the **source** , not the final runtime backing store.\n\nYour realistic choices are:\n\n  1. **Avoid the copy**\nUse `streaming=True` or `IterableDataset.from_parquet()`.\n\n  2. **Reduce the copy**\nUse `columns`, `filters`, or pre-reduce with DuckDB/PyArrow.\n\n  3. **Accept the copy deliberately**\nMove `HF_DATASETS_CACHE`, save a named prepared dataset with `save_to_disk()`, and clean temporary build caches.\n\n\n\n\n## Compact summary\n\n  * The cache is not a duplicate download; it is the Arrow runtime representation.\n  * Parquet is compressed source storage; Arrow is the regular `Dataset` backend.\n  * `cache_dir` moves the cache; it does not remove it.\n  * `disable_caching()` does not stop initial `load_dataset()` preparation.\n  * `streaming=True` is the cleanest no-extra-copy path.\n  * If avoiding streaming, select columns and filter before Arrow materialization.\n  * For repeated use, convert once, `save_to_disk()`, then `load_from_disk()`.\n  * If preprocessing, stream raw Parquet into final processed shards instead of building a full raw Arrow cache first.\n\n",
  "title": "Load_dataset() creates a duplicate in cache"
}