Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreic5udbqh2fmpo4k7lclnkabq2j7hhcgg3mdvp4mkw3q4e67e7xoee",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mgmgjeug7oj2"
  },
  "path": "/t/huggingface-datasets-card-not-work-correctly/174072#post_2",
  "publishedAt": "2026-03-09T07:20:51.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "Hugging Face Forums",
    "audio, text]` is the documented way to force the **modality**. ([Hugging Face",
    "datasets, dask, polars, pandas]` is the documented way to force **library associations** when the page does not show them automatically. ([Hugging Face"
  ],
  "textContent": "For now, a quick diagnosis:\n\n* * *\n\n## What I think is happening in your case\n\nYour new dataset is **not failing in the core way that would actually break usability**. Right now, its page shows a working **Dataset Viewer** , an **“Expand in Data Studio”** link, preview rows with the columns `audio`, `text`, `source`, `sample_rate`, and `speaker`, and a visible **“Downloads last month”** count of **10**. That means the important backend pieces are already working: the Hub can read the repo, understand the schema, and power the viewer/Data Studio layer. (Hugging Face)\n\nWhat is missing on the new page is the **top header metadata surface** : no `Modalities`, no `Formats`, no `Libraries`, and no separate top tab labeled `Data Studio`. Your older dataset page does show all of those: `Modalities: Audio, Text`, `Formats: parquet`, `Libraries: Datasets, Dask, Polars`, plus a top-level `Data Studio` tab. So the two repos are being rendered differently at the page-header level even though both are dataset-viewer-compatible. (Hugging Face)\n\n## Why I do **not** think the dataset itself is broken\n\nBoth repositories are built around the same basic storage pattern: a `data/` directory containing Parquet shards named `train-...parquet`. Your new repo has six shards (`train-00000-of-00006.parquet` through `train-00005-of-00006.parquet`), and the older repo has forty-one shards. Since both use Parquet shards under `data/`, the missing badges on the new page are **not well explained by using the wrong file format or the wrong top-level layout**. (Hugging Face)\n\nHugging Face’s own docs say that a dataset with a supported structure and supported file formats automatically gets a **Dataset Viewer** , and that the viewer backend auto-converts Hub datasets to **Parquet** for exploration. Your new page already passes that bar. In other words, the fundamental data ingestion pipeline appears healthy. (Hugging Face)\n\n## The real split: viewer/backend vs. metadata/header\n\nHugging Face separates two things:\n\n  1. **Dataset Viewer / Data Studio / schema preview** , which come from the viewer backend and recognizable dataset structure.\n  2. **Dataset card metadata and header badges** , which come from the `README.md` card and especially its YAML metadata block at the top. (Hugging Face)\n\n\n\nThat distinction matters a lot here. Your new repo already has the backend-driven features that matter most to users: preview, rows, columns, Parquet, and Data Studio access through the viewer. The missing pieces are mostly the **card/header decorations and tags**. That points much more strongly to a **metadata inference / indexing / rendering inconsistency** than to a damaged dataset. (Hugging Face)\n\n## The most important clue in your README\n\nYour new `README.md` already contains structured metadata such as `dataset_info`, feature types, split info, `configs`, `task_categories`, `language`, and a `license` field. But the `license` field in the YAML is currently just **`cc`** , while the human-readable body below says the dataset is released under **CC BY 4.0**. On the page, the header also shows only the generic **`cc`** badge. Hugging Face’s license list distinguishes the broad family identifier `cc` from the specific identifier `cc-by-4.0`. So your repo is currently telling the Hub “generic Creative Commons” in metadata while telling readers “CC BY 4.0” in prose. That mismatch is a concrete sign that the metadata layer is not as specific as it should be. (Hugging Face)\n\nBy contrast, your old dataset’s page shows rich badges even though its README preview is effectively empty apart from auto-generated metadata. That tells me the older page likely benefited from **auto-inference or prior indexing behavior** that the new page has not reproduced in the same way. In plain terms: the older repo got lucky or got indexed more favorably; the newer one did not. (Hugging Face)\n\n## About the missing Modalities and Libraries\n\nHugging Face documents that:\n\n  * modality is **auto-detected** from the files, but you can **force it** by adding tags such as `audio` and `text` to the YAML metadata;\n  * the dataset page automatically shows compatible libraries, but you can also **manually associate libraries** by tagging the dataset with values such as `datasets`, `dask`, `pandas`, or `webdataset`. (Hugging Face)\n\n\n\nThat is almost exactly your situation. The page clearly understands your columns well enough to show `audio` and `text` in the viewer, but it is **not surfacing those as top-level header badges**. Since HF explicitly documents manual tags as the fallback, the safest conclusion is: **automatic inference did not fully populate the page header for this repo**. (Hugging Face)\n\n## About Data Studio\n\nThis part is easy to misread. Hugging Face says **Data Studio is enabled by default for all public datasets**. Your new page does not show a separate top `Data Studio` tab the way the older page does, but it **does** show `Expand in Data Studio` inside the viewer area. So for your repo, the correct reading is not “Data Studio is unavailable”; it is “Data Studio is available, but the page is presenting it differently.” (Hugging Face)\n\n## About download counts\n\nThis part is partly normal, partly confusing.\n\nHugging Face’s current rule is that **all files downloaded by the same user/IP within a 5-minute window in one repository count as a single dataset download**. That is done specifically so one user downloading many files or splits does not inflate the counter. So if you uploaded six Parquet shards, a person fetching several of them in one session still may count as only **one** dataset download. (Hugging Face)\n\nThat means a low number does **not** automatically mean tracking is broken. And in your repo’s current state, the count is not missing anyway: the page shows **10 downloads last month**. So the strongest answer is: **download tracking appears to be active right now**. (Hugging Face)\n\nAt the same time, public forum posts do show that Hugging Face has had periods where dataset download statistics appeared stuck or delayed for some users. There are reports from June 2024, June–July 2025, and March 2026 describing counters not updating for days. Those reports do not prove your repo has a bug, but they do show that **stats glitches on the Hub are a real thing** , so it is reasonable that this made you suspicious. (Hugging Face Forums)\n\n## My actual diagnosis\n\nFor **your** repository, I would describe it this way:\n\n  * **Core dataset publishing:** working. Viewer works, Parquet is recognized, schema is recognized, Data Studio is reachable, and downloads are being counted. (Hugging Face)\n  * **Header metadata / discoverability badges:** inconsistent. The page is not surfacing modalities, format, libraries, or the dedicated Data Studio tab the way the older repo does. (Hugging Face)\n  * **Most likely cause:** incomplete or inconsistent **Hub-side metadata inference / indexing** , made worse by your metadata being partially generic (`license: cc`) and not explicitly tagging modality/library. (Hugging Face)\n\n\n\nSo: **not a broken dataset, but an incomplete card/header rendering state**.\n\n## What I would do\n\n### 1. Make the metadata explicit instead of relying on auto-detection\n\nAdd or revise the YAML block at the top of `README.md` so it explicitly tells the Hub what you want shown.\n\nUse something close to this:\n\n\n    ---\n    pretty_name: Synthetic Turkish TTS Data\n    language:\n      - tr\n    license: cc-by-4.0\n    task_categories:\n      - text-to-speech\n      - automatic-speech-recognition\n    tags:\n      - audio\n      - text\n      - datasets\n      - dask\n      - polars\n      - pandas\n\n    dataset_info:\n      features:\n        - name: audio\n          dtype: audio\n        - name: text\n          dtype: string\n        - name: source\n          dtype: string\n        - name: sample_rate\n          dtype: int64\n        - name: speaker\n          dtype: string\n      splits:\n        - name: train\n          num_bytes: 2942869578\n          num_examples: 13000\n      download_size: 3303296950\n      dataset_size: 2942869578\n\n    configs:\n      - config_name: default\n        data_files:\n          - split: train\n            path: data/train-*\n    ---\n\n\nWhy this helps:\n\n  * `license: cc-by-4.0` matches what your card text already says, and it uses the **specific** official Hugging Face identifier instead of the generic `cc`. (Hugging Face)\n  * `tags: audio, text]` is the documented way to force the **modality**. ([Hugging Face)\n  * `tags: datasets, dask, polars, pandas]` is the documented way to force **library associations** when the page does not show them automatically. ([Hugging Face)\n\n\n\n### 2. Commit the README change and let the page re-index\n\nBecause the header badges are metadata-driven, a README metadata commit is the right trigger for the Hub to recalculate what should be shown. Your current repo history already shows recent README updates, so another small metadata cleanup commit is reasonable. (Hugging Face)\n\n### 3. Judge success by the right signals\n\nAfter the commit, the signals that matter most are:\n\n  * header shows `Modalities: Audio, Text`;\n  * header shows one or more `Libraries`;\n  * license badge becomes `cc-by-4.0` rather than generic `cc`;\n  * viewer still works;\n  * `Expand in Data Studio` still works. (Hugging Face)\n\n\n\n### 4. If the header still does not update, treat it as a Hub bug, not your fault\n\nThere are enough public examples of viewer/stat inconsistencies on the Hub that I would not keep rewriting your dataset structure forever if the explicit tags do not fix it. Once your metadata is explicit and the backend is still healthy, a persistent missing-badges problem is best treated as **a Hub-side UI/indexing issue**. Public reports about row counts, Data Studio visibility, and stats glitches show that this class of issue does happen. (Hugging Face Forums)\n\n## Bottom line\n\nFor your case, my answer is:\n\n  * **Is this normal?** Partly. It is not unusual for Hugging Face’s metadata/display layer to be less consistent than the actual dataset backend, and the docs explicitly allow manual YAML tags because auto-detection is not always enough. (Hugging Face)\n  * **Is your dataset broken?** No. The important backend features are already working. (Hugging Face)\n  * **Are downloads tracked?** Yes, at least currently. The page shows `10`, and HF’s counting rules also make counts look lower than many authors expect. (Hugging Face)\n  * **What should you do?** Make the README YAML fully explicit: set `license: cc-by-4.0`, add `audio` and `text` tags, add the library tags you want, then let the page re-index. If the header still stays incomplete after that, treat it as a Hugging Face page/indexing issue rather than a dataset-format issue on your side. (Hugging Face)\n\n",
  "title": "Huggingface datasets card not work correctly"
}