Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiefwbj2mqv7vzq3nfkictfgc6dvj4otaur3hes4ln4dpe2aksjpyq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mgtrsbrfmtb2"
  },
  "path": "/t/valueerror-loading-helsinki-nlp-tokenizers/174192#post_3",
  "publishedAt": "2026-03-12T02:32:19.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "PyPI",
    "GitHub"
  ],
  "textContent": "> But the mystery remains . . . WHY didn’t it work as advertised?!?!?\n\nIn Transformers, models and tokenizers _may_ implicitly require backends outside of Transformers and PyTorch in some case.\n\nIn this case, uninstalling `sentencepiece` allowed me to reproduce the issue in my environment. Explicitly installing `sentencepiece` would be a quick workaround.\n\n* * *\n\nNo, you probably do **not** need to downgrade `transformers`. Current Transformers supports Python 3.10+, so Python 3.12 itself is not the problem, and Marian is still a supported `AutoTokenizer` family in the docs. (PyPI)\n\nWhat is happening is that `Helsinki-NLP/opus-mt-*` uses **MarianTokenizer** , and MarianTokenizer is **SentencePiece-based**. In current Transformers, the auto-mapping for Marian is effectively: use `MarianTokenizer` **only if** `sentencepiece` is available. The Marian tokenizer source also explicitly requires the `sentencepiece` backend. (GitHub)\n\nSo when `sentencepiece` is missing or not visible in the current runtime, `AutoTokenizer.from_pretrained(...)` can fail with that misleading error:\n\n> `Unrecognized configuration class ... MarianConfig ...`\n\neven though `MarianConfig` is listed. That is a known class of bug/confusing behavior; there is a recent Transformers issue specifically about cryptic `AutoTokenizer` errors for SentencePiece tokenizers when `sentencepiece` is not installed. (GitHub)\n\nThat also explains why `MarianTokenizer.from_pretrained(...)` helped: it bypasses the auto-dispatch layer. If `sentencepiece` is available, direct `MarianTokenizer` works. If `sentencepiece` is truly missing, direct `MarianTokenizer` should fail too, but with a much clearer “install SentencePiece” error. (GitHub)\n\nUse this fix:\n\n\n    pip install -U sentencepiece\n\n\nThen **restart the Python process / Jupyter kernel** and retry:\n\n\n    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n\n    tokenizer = AutoTokenizer.from_pretrained(\"Helsinki-NLP/opus-mt-ru-en\")\n    model = AutoModelForSeq2SeqLM.from_pretrained(\"Helsinki-NLP/opus-mt-ru-en\")\n\n\nOr, as a workaround, use:\n\n\n    from transformers import MarianTokenizer\n    tokenizer = MarianTokenizer.from_pretrained(\"Helsinki-NLP/opus-mt-ru-en\")\n\n\nYou can also be explicit in v5:\n\n\n    tokenizer = AutoTokenizer.from_pretrained(\n        \"Helsinki-NLP/opus-mt-ru-en\",\n        backend=\"sentencepiece\",\n    )\n\n\nThat only helps if `sentencepiece` is actually installed. (GitHub)\n\nSo the short explanation is:\n\n  * **Not** “Python 3.12 broke Marian.”\n  * **Not necessarily** “downgrade Transformers.”\n  * **Most likely** : `sentencepiece` was missing or not visible in that runtime, and `AutoTokenizer` surfaced it as a confusing MarianConfig error. (GitHub)\n\n",
  "title": "ValueError loading Helsinki-NLP tokenizers"
}