Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreig3kqffcs3q3l25y5expzvcvuk2aikuxmkydsysjmq6rwndmll7lm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjnybbheikj2"
  },
  "path": "/t/dataset-viewer-broke-after-repo-rename/175327#post_2",
  "publishedAt": "2026-04-17T01:45:23.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "@lhoestq",
    "huggingface.co",
    "github.com"
  ],
  "textContent": "Renaming the dataset itself may simply be a trigger in this case, not the cause:\njust in case, @lhoestq\n\n* * *\n\nThe most likely explanation is that the **rename triggered a fresh viewer-side Parquet rebuild** , and that rebuild failed inside Hugging Face’s backend conversion path. The missing `refs/convert/parquet` ref is consistent with that failure, because Hugging Face documents that the dataset viewer’s Parquet copy is published on `refs/convert/parquet`, and the Hub API treats `converts` as internal preprocessed refs separate from ordinary branches and tags. (huggingface.co, huggingface.co)\n\n## 1. What the dataset viewer actually does\n\nThe full dataset viewer is not just rendering your raw files directly from the repo. Hugging Face exposes a dedicated dataset-viewer API with endpoints such as `/is-valid`, `/first-rows`, `/rows`, `/search`, `/filter`, `/parquet`, and `/size`. For the full viewer, Hugging Face builds and serves a Parquet-backed representation of the dataset; the Parquet docs explicitly say those files are published on `refs/convert/parquet`. (huggingface.co, huggingface.co)\n\nThat architectural detail matters because it means there are really **two layers** involved:\n\n  1. your visible dataset repo and its contents, and\n  2. an internal, generated viewer layer that mirrors the dataset in Parquet for browsing and querying. (huggingface.co, huggingface.co)\n\n\n\nSo when the UI says:\n\n> “The full dataset viewer is not available. Only showing a preview of the rows.”\n\nthat does **not** automatically mean your dataset files are broken. Hugging Face’s validity docs explicitly document that `preview` and `viewer` are separate capabilities, so a dataset can remain previewable while the full viewer is unavailable. (huggingface.co)\n\n## 2. What your traceback says, technically\n\nThe important part of your traceback is not the outer `DatasetGenerationError`. The important part is the inner crash:\n\n\n    original_shard_lengths[original_shard_id] += len(table)\n    IndexError: list index out of range\n\n\nThat line comes from the dataset-building logic used during the viewer’s Parquet-generation job. In other words, the failure is happening while Hugging Face is preparing the split for Parquet output, not while the browser is simply reading an already-existing table. (github.com)\n\nThis is especially significant because Hugging Face already has a public upstream fix for that exact failure class. There is a `huggingface/datasets` PR titled **“Fix index out of bound error with`original_shard_lengths`”**, and the related `datasets` 4.6.0 release notes include **“Support empty shard in`from_generator`.”** That is the strongest single clue in your entire report. It means the crash pattern itself is already known to Hugging Face and is not just something unique to your repo rename. (github.com, github.com)\n\n## 3. Why the rename likely triggered it\n\nA repo rename can be completely harmless for the raw data and still break the viewer layer.\n\nHugging Face documents repo moves through repository tooling, but the viewer’s Parquet mirror is documented separately as generated state living in `refs/convert/parquet`, and the Hub API classifies these as internal `converts` refs. That means a rename is not just a Git rename from the viewer backend’s perspective; it can require the backend to **re-resolve, regenerate, or republish** the derived Parquet artifacts under the new repo identity. (huggingface.co, huggingface.co, huggingface.co)\n\nThat gives a very plausible sequence for your case:\n\n  1. the dataset worked before because the old hidden Parquet mirror already existed,\n  2. the repo was renamed,\n  3. the viewer backend had to rebuild or reattach the Parquet mirror,\n  4. the rebuild hit the `original_shard_lengths` bug,\n  5. the Parquet publish step never completed,\n  6. `refs/convert/parquet` is now missing,\n  7. the page falls back to preview-only mode.\n\n\n\nThat sequence is an inference, but it is strongly supported by the official architecture docs and the public bug history. (huggingface.co, github.com, github.com)\n\n## 4. Why the missing `refs/convert/parquet` ref is such a strong clue\n\nThe missing ref is not just a side symptom. It is one of the most important parts of the diagnosis.\n\nHugging Face’s Parquet docs say the viewer’s Parquet files are published on `refs/convert/parquet`. Meanwhile, the Hub API docs explain that `converts` are internal refs used to push preprocessed data in dataset repos. So if that ref existed before the rename and is absent afterward, the natural reading is:\n\n  * the old generated viewer state is gone or no longer attached, and\n  * the new generated viewer state failed to build. (huggingface.co, huggingface.co)\n\n\n\nThere is also public evidence that these generated refs can become stale or out of sync after repo changes. In one Hugging Face discussion, a maintainer explains that the auto-generated `refs/convert/*` branches are updated only when the viewer updates, and the user shows `refs/convert/parquet` not matching newer content on `main`. In a separate dataset-viewer issue, Hugging Face describes a corner case where old Parquet files remain on the Hub after the dataset is updated, so the viewer layer and `main` can diverge. (huggingface.co, github.com)\n\nSo the rename-specific angle in your case is not “renaming destroys data.” It is more like “renaming forced the system back through a fragile generated-state path.”\n\n## 5. What is most likely happening in your specific case\n\nMy best technical reading is this:\n\n  * your underlying dataset files are probably still fine,\n  * the viewer backend tried to regenerate the Parquet mirror for the renamed repo,\n  * during split preparation, it encountered a shard bookkeeping pattern that the older code handled incorrectly,\n  * the Parquet-generation job aborted before it could republish `refs/convert/parquet`,\n  * the viewer UI now has only the preview path available. (github.com, huggingface.co, huggingface.co)\n\n\n\nIf I had to rank the causes:\n\n### Most likely\n\nA **known Hugging Face backend bug** in Parquet generation around shard bookkeeping, exposed when the rename forced regeneration. (github.com, github.com)\n\n### Also plausible\n\nA **stale or desynchronized hidden Parquet ref** problem after repo change, where the viewer’s generated state no longer lines up cleanly with `main`. (huggingface.co, github.com)\n\n### Less likely\n\nA real corruption or format defect in your dataset content itself. The traceback is pointing much more strongly at the generation layer than at raw-data parsing. (github.com)\n\n### Least likely\n\nA generic upstream rate limit or external hosting failure. Hugging Face does have dataset-viewer failures where external 429/403 errors bubble up as a generic generation failure, but those cases have a different shape than your `original_shard_lengths` crash. (github.com)\n\n## 6. Why I do **not** think the rename directly “broke the data”\n\nA rename changes the repo identity. It does not normally rewrite the actual dataset contents. The evidence in your traceback points to the conversion job that rebuilds viewer artifacts, not to a change in the rows themselves. The official viewer docs and the Hub API docs reinforce that distinction: the full viewer depends on separate generated Parquet state, and that state is managed through hidden convert refs. (huggingface.co, huggingface.co)\n\nThat is why the right mental model is:\n\n> rename = trigger\n>  conversion bug / stale viewer state = root problem\n\nnot:\n\n> rename = data corruption\n\n## 7. What to do now\n\n### Step 1. Check the viewer state directly\n\nRun:\n\n\n    curl \"https://datasets-server.huggingface.co/is-valid?dataset=<namespace>/<repo>\"\n    curl \"https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>\"\n\n\nand:\n\n\n    from huggingface_hub import HfApi\n\n    api = HfApi()\n    print(api.list_repo_refs(\"<namespace>/<repo>\", repo_type=\"dataset\"))\n\n\nThese tell you three different things:\n\n  * whether Hugging Face considers the dataset previewable but not viewable,\n  * whether the Parquet layer exists,\n  * whether the convert refs are present at all. (huggingface.co, huggingface.co, huggingface.co)\n\n\n\n### Step 2. Make one tiny commit\n\nHugging Face’s dataset-viewer issue history indicates that dataset updates trigger backend jobs through a webhook path. So a small README or dataset card edit is a reasonable way to retrigger Parquet-and-info generation. It is not guaranteed to work if the worker still carries the buggy code path, but it is the simplest clean retry. (github.com)\n\n### Step 3. Open a dataset discussion with the exact details\n\nThis matters. The `dataset-viewer` repo itself says that when a dataset page shows a viewer error, the efficient route is to open a discussion on the dataset page and tag the viewer team. Your report should include:\n\n  * repo was renamed,\n  * `refs/convert/parquet` existed before and is now missing,\n  * exact traceback,\n  * especially the `original_shard_lengths[...] IndexError`. (github.com)\n\n\n\nThat gives Hugging Face maintainers the strongest possible signal that this is a backend conversion problem, not a generic UI complaint.\n\n## 8. The best self-service workaround\n\nIf you need a durable fix without waiting for Hugging Face to repair or rerun the conversion, the cleanest workaround is to **publish the dataset natively as Parquet** on `main`.\n\nWhy that works:\n\n  * Hugging Face documents that if the dataset is already in Parquet, the `refs/convert/parquet` branch can usually just link to the original Parquet files instead of performing a new conversion. (huggingface.co)\n  * The `datasets` docs say `Dataset.push_to_hub()` publishes the dataset as a Parquet dataset and exposes shard controls such as `max_shard_size` and `num_shards`. (huggingface.co)\n\n\n\nSo in your case, publishing Parquet directly is not just an optimization. It is a way to bypass the exact conversion layer that is currently failing.\n\n## 9. If you regenerate locally\n\nIf your workflow involves regenerating the dataset locally before upload, then use a `datasets` version new enough to include the fix path associated with the `original_shard_lengths` bug. The public evidence points to that class of fix being present by `datasets` 4.6.0. That does not control the version Hugging Face is running in its backend workers, but it does reduce the chance of reproducing the same problem on your side while you generate and upload Parquet yourself. (github.com, github.com)\n\n## 10. My bottom-line diagnosis\n\nHere is my actual opinion, stated plainly:\n\n**The rename probably exposed a Hugging Face viewer-generation bug or stale hidden-ref state.** The missing `refs/convert/parquet` ref is best understood as a symptom of failed or incomplete viewer regeneration. The exact traceback points strongly toward a known shard-indexing bug in the backend conversion path. Your dataset contents are probably not the real problem. (huggingface.co, huggingface.co, github.com, github.com, huggingface.co, github.com)\n\n## 11. Recommended action order\n\n  1. Check `/is-valid`, `/parquet`, and `list_repo_refs()`. (huggingface.co, huggingface.co, huggingface.co)\n  2. Make one tiny commit to retrigger the viewer jobs. (github.com)\n  3. Open a dataset discussion with the exact traceback and the rename correlation. (github.com)\n  4. If you need a self-service fix, republish as native Parquet. (huggingface.co, huggingface.co)\n\n",
  "title": "Dataset viewer broke after repo rename"
}