Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicop6mvkus7kojlw3mv3s327wsaertvbgv42ib6trlvviywysniwq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mk4vimykuck2"
  },
  "path": "/t/problem-of-dataset-formatting-and-croissant-metadata/175458#post_2",
  "publishedAt": "2026-04-23T00:25:09.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face"
  ],
  "textContent": "This applies to the dataset itself, but especially for the dataset viewer, the `README.md` file serves as a configuration file. If a model or dataset isn’t recognized automatically, editing the beginning of the `README.md` file (`YAML` section) may resolve the issue:\n\nHowever, there’s always a possibility of a bug, so don’t worry if it doesn’t work…\n\n* * *\n\n## What is going wrong\n\nHugging Face is currently treating your repo like **one CSV dataset** , but your repo actually contains **two different kinds of tables** :\n\n  * `instances.csv` = node table\n  * `interactions.csv` = edge table\n\n\n\nThose two files do **not** have the same columns. Your dataset page already shows that exact problem: Hugging Face inferred only one subset (`default`) and one split (`train`), then failed with `DatasetGenerationCastError` because one group of files has columns `Source, Target, Weight` while another has `host, version, registration_enabled, Id, Label`. (Hugging Face)\n\n## So is it just a matter of time?\n\nProbably not.\n\nThis does not look like “the upload is still processing.” It looks like a **real schema error**. The page is already showing a specific failure, not just a temporary loading state. (Hugging Face)\n\n## Why this happens\n\nThe Hugging Face dataset viewer is built around a **tabular** idea: one data point is one row, and features are columns. If it auto-detects many CSV files as belonging to one dataset split, it expects them to share one schema. Your graph data breaks that assumption because each graph is stored as **two different tables** with different columns. (Hugging Face)\n\n## Why the Croissant file is almost empty\n\nOn Hugging Face, generated Croissant metadata is built from the dataset-viewer / Parquet pipeline. The official Croissant example shows `recordSet` entries tied to Hugging Face–converted Parquet files for each config. So if the viewer cannot cleanly build the dataset first, the generated Croissant will often be thin or missing useful `recordSet` entries. (Hugging Face)\n\n## The real cause, in one sentence\n\nYour dataset is not failing because it is a graph dataset.\n\nIt is failing because Hugging Face is currently reading it as **one mixed CSV dataset with incompatible schemas**. (Hugging Face)\n\n## The easiest fix\n\nTell Hugging Face explicitly that these are **two separate dataset parts**.\n\nUse manual config in `README.md`, with one config for `instances.csv` and one for `interactions.csv`. The docs show that dataset configs use `config_name`, `data_files`, `split`, and `path`. (Hugging Face)\n\nA simple starting point is:\n\n\n    ---\n    configs:\n      - config_name: instances\n        data_files:\n          - split: train\n            path: \"*/*/*/instances.csv\"\n\n      - config_name: interactions\n        data_files:\n          - split: train\n            path: \"*/*/*/interactions.csv\"\n    ---\n\n\nThat tells Hugging Face: “do not mix these files together.” This is exactly the kind of problem manual configuration is for. (Hugging Face)\n\n## An even better fix\n\nAdd a few columns to both table types so every row says which graph it belongs to, for example:\n\n  * `graph_id`\n  * `software`\n  * `graph_type`\n  * `snapshot_date`\n\n\n\nThat way:\n\n  * all node rows can live in one clean schema\n  * all edge rows can live in one clean schema\n  * users can filter by graph\n  * Croissant has a much better chance of becoming meaningful\n\n\n\nThis is not required by the docs word-for-word, but it matches the viewer’s row/column design much better. (Hugging Face)\n\n## Best long-term design\n\nThe most Hugging Face-friendly structure is often:\n\n**one row = one graph snapshot**\n\nFor example, one processed dataset where each row contains:\n\n  * graph metadata\n  * counts\n  * maybe paths to node/edge files\n  * or another structured representation\n\n\n\nThat works better with the viewer because the viewer is fundamentally row-based. (Hugging Face)\n\n## About the missing “Use this dataset” button\n\nI would treat that as a **symptom** , not the main problem.\n\nFirst fix the dataset structure so the viewer can understand it. Then check the page again. Right now the clearer signal is the cast error on the page itself. (Hugging Face)\n\n## What to do next\n\n  1. Add manual configs to separate `instances.csv` and `interactions.csv`. (Hugging Face)\n\n  2. Re-push the repo.\n\n  3. Check Hugging Face’s dataset server endpoints:\n\n     * `/is-valid`\n     * `/splits`\n     * `/first-rows`\nThe docs recommend these endpoints for checking validity, available configs/splits, and preview rows. (Hugging Face)\n  4. Only after that, check `/croissant` again. (Hugging Face)\n\n\n\n\n## Bottom line\n\nYou are close.\n\nThe issue is not that your dataset is “too unusual” for Hugging Face. The issue is that Hugging Face needs **clearer instructions** for how to separate your two table types. Once you stop the node CSVs and edge CSVs from being merged into one inferred split, the viewer should improve, and the Croissant output should likely improve too. (Hugging Face)",
  "title": "Problem of Dataset formatting and Croissant metadata"
}