Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreieil2y5x5dxq3c3rmamgogxgrzarmmt6cuoxjz6mf4pli32mfzmny",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mj4ecsbrfwa2"
  },
  "path": "/t/dual-tags-for-turkish-language-in-search-filters/175129#post_2",
  "publishedAt": "2026-04-10T00:08:09.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub",
    "Hugging Face",
    "Hugging Face Forums"
  ],
  "textContent": "This is because the implementation allows for multiple language-codes.\n\nAs far as the display on the Hub GUI, it’s always been like that. Or so I thought, but judging by GitHub, it seems to be a display bug that’s remained unresolved since 2022… Since they apparently intended to fix it, it’s probably not a design choice. It doesn’t cause any real harm, but I suppose it’s fair to treat it as a bug…\n\n* * *\n\nMy best read is this:\n\n**The current behavior is split into two layers.**\nAt the **metadata layer** , it looks intended that both `tr` and `tur` can exist. At the **search/filter UI layer** , it looks unintended, or at least like an unfinished normalization problem, that those aliases show up as separate Turkish filters instead of one canonical Turkish bucket. (GitHub)\n\n## 1. The background\n\nHugging Face repo cards explicitly allow language metadata to be written using **ISO 639-1, ISO 639-2, or ISO 639-3** codes. Their code and docs both say that for models and datasets, `language` can be a two-letter or three-letter code. That means values like `tr`, `tur`, `en`, `eng`, `fr`, and `fra` are all valid metadata inputs. (GitHub)\n\nThat matters because it explains **why duplicates can be created in the raw data**. If one repo author writes `tr` and another writes `tur`, both are accepted by the platform. So duplicate language buckets are not surprising at the storage level. (GitHub)\n\n## 2. What the current UI is actually doing\n\nThe current UI is not just accepting both codes in metadata. It is also **surfacing them separately**.\n\nOn Hugging Face’s `/languages` page, the same real-world language appears more than once with different codes and different counts. For example:\n\n  * **English** appears as `en` with **57,167 datasets / 312,494 models** and also as `eng` with **1,896 / 1,184**.\n  * **French** appears as `fra` with **10,291 / 1,999** and also as `fr` with **3,002 / 16,956**.\n  * **Turkish** appears as `tr` with **1,487 / 5,867** and also as `tur` with **165 / 80**.\n  * **German** appears as `de` with **2,349 / 14,974** and also as `deu` with **691 / 937**. (Hugging Face)\n\n\n\nThe model search pages show the same split. Right now, `language=tr` returns **5,872** models, while `language=tur` returns **81**. Likewise, `language=en` returns **312,770** models, while `language=eng` returns **1,186**. That means the filter backend is treating these aliases as **different search keys** , not as one normalized language. (Hugging Face)\n\nThere is also a UI-level clue that normalization is weak. On the models filter page, the quick language list shows **“English” twice** and **“French” twice** in the same visible filter row. That is exactly what you would expect if multiple codes were being mapped to the same display label without being deduplicated first. (Hugging Face)\n\n## 3. Why I think the metadata part is intended\n\nThis part is the easiest to call.\n\nHugging Face’s own repo-card code and docs do not restrict users to one canonical code family. They explicitly allow 639-1, 639-2, and 639-3. So a repository tagged with `tr` is valid, and a repository tagged with `tur` is also valid. That is not a bug by itself. (GitHub)\n\nThere is also a long-standing community push for **broader language-code support** , not narrower support. In the forum discussion about BCP-47 or at least ISO 639-3 support, users argue that two-letter codes are incomplete and that the Hub should support broader language identifiers. That aligns with Hugging Face allowing multiple code standards in metadata. (Hugging Face Forums)\n\nSo if the question is, “Should Hugging Face permit repos to carry `tr` or `tur`?” then the answer is **yes, that appears intentional**. (GitHub)\n\n## 4. Why I think the UI behavior is probably not intended\n\nThis part is an inference, but a strong one.\n\nThe strongest evidence is **issue`hub-docs#193`**. In that issue, the discussion says a useful improvement would be to **transform ISO 639-2 or 639-3 tags into ISO 639-1** , and it gives `fra` versus `fr` as the concrete example. The stated reason is discoverability: datasets tagged `fra` should be findable as French. That is the opposite of the current UI behavior, where `fr` and `fra` are still separate filter buckets. (GitHub)\n\nHugging Face’s own **Huggy Lingo** blog post points the same way. It says that when their metadata-enrichment pipeline predicts a language in **ISO 639-3** , they convert it to **ISO 639-1 where possible** , and it explicitly says this is because ISO 639-1 codes have **better support in the Hub UI for navigating datasets**. That tells me the product thinking was not “keep all equivalent aliases separate in the UI.” It was closer to “accept broad inputs, but steer toward a canonical UI representation.” (GitHub)\n\nRelated GitHub issues also frame the broader language-filter situation as a **problem** , not as a settled design choice. One issue says datasets tagged by ISO language code were not accessible through the language search form. Another says some ISO 639-3 codes were present in the list but impossible to enter in the Hub. Those are exactly the kinds of bugs you see when storage, input widgets, and search normalization are not fully aligned. (GitHub)\n\nThe dataset issue about the language-code database goes even wider. It calls for connecting to a bigger language-code database and notes that the current list is partial and hard to maintain. That again sounds like an **incomplete language-metadata system** , not a deliberate choice to present alias duplicates as separate first-class languages forever. (GitHub)\n\n## 5. What I think is happening technically\n\nThe simplest model is:\n\n  1. Hugging Face **accepts multiple code standards** in repo metadata.\n  2. The Hub **stores and indexes those values largely as provided**.\n  3. The UI converts codes into human-readable names like “Turkish” or “English”.\n  4. But the UI and search system **do not fully canonicalize aliases** before counting, filtering, or displaying them. (GitHub)\n\n\n\nThat would explain all of the observed behavior at once:\n\n  * why `tr` and `tur` are both allowed,\n  * why they get separate counts,\n  * why both show up as “Turkish,”\n  * why the same duplication shows up for English, French, German, and others,\n  * and why old GitHub issues talk about normalization and search discoverability. (GitHub)\n\n\n\n## 6. My actual conclusion\n\nHere is the plain version:\n\n  * **Intended:** Hugging Face allowing both `tr` and `tur` in metadata. (GitHub)\n  * **Probably not intended as the final UX:** the search UI treating those aliases as separate Turkish filters and separate count buckets. (GitHub)\n\n\n\nSo I would describe it as:\n\n> **Not a metadata bug. Likely a search/filter normalization bug or product gap.**\n\n## 7. Confidence level\n\nI am **high confidence** on the first part: accepting both code families is by design. (GitHub)\n\nI am **medium-high confidence** on the second part: the current duplicate-filter UI is probably not intended behavior, because Hugging Face’s own issues and blog material point toward **canonicalization for discoverability** , not toward keeping alias codes as separate user-facing language buckets. I cannot prove that with a maintainer quote saying “this is a bug,” but the direction of the evidence is pretty clear. (GitHub)",
  "title": "Dual Tags for Turkish Language In Search Filters"
}