{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiefwbj2mqv7vzq3nfkictfgc6dvj4otaur3hes4ln4dpe2aksjpyq",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mgtyjj6ruwy2"
},
"path": "/t/valueerror-loading-helsinki-nlp-tokenizers/174192#post_3",
"publishedAt": "2026-03-12T02:32:19.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"PyPI",
"GitHub"
],
"textContent": "> But the mystery remains . . . WHY didn’t it work as advertised?!?!?\n\nIn Transformers, models and tokenizers _may_ implicitly require backends outside of Transformers and PyTorch in some case.\n\nIn this case, uninstalling `sentencepiece` allowed me to reproduce the issue in my environment. Explicitly installing `sentencepiece` would be a quick workaround.\n\n* * *\n\nNo, you probably do **not** need to downgrade `transformers`. Current Transformers supports Python 3.10+, so Python 3.12 itself is not the problem, and Marian is still a supported `AutoTokenizer` family in the docs. (PyPI)\n\nWhat is happening is that `Helsinki-NLP/opus-mt-*` uses **MarianTokenizer** , and MarianTokenizer is **SentencePiece-based**. In current Transformers, the auto-mapping for Marian is effectively: use `MarianTokenizer` **only if** `sentencepiece` is available. The Marian tokenizer source also explicitly requires the `sentencepiece` backend. (GitHub)\n\nSo when `sentencepiece` is missing or not visible in the current runtime, `AutoTokenizer.from_pretrained(...)` can fail with that misleading error:\n\n> `Unrecognized configuration class ... MarianConfig ...`\n\neven though `MarianConfig` is listed. That is a known class of bug/confusing behavior; there is a recent Transformers issue specifically about cryptic `AutoTokenizer` errors for SentencePiece tokenizers when `sentencepiece` is not installed. (GitHub)\n\nThat also explains why `MarianTokenizer.from_pretrained(...)` helped: it bypasses the auto-dispatch layer. If `sentencepiece` is available, direct `MarianTokenizer` works. If `sentencepiece` is truly missing, direct `MarianTokenizer` should fail too, but with a much clearer “install SentencePiece” error. (GitHub)\n\nUse this fix:\n\n\n pip install -U sentencepiece\n\n\nThen **restart the Python process / Jupyter kernel** and retry:\n\n\n from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n\n tokenizer = AutoTokenizer.from_pretrained(\"Helsinki-NLP/opus-mt-ru-en\")\n model = AutoModelForSeq2SeqLM.from_pretrained(\"Helsinki-NLP/opus-mt-ru-en\")\n\n\nOr, as a workaround, use:\n\n\n from transformers import MarianTokenizer\n tokenizer = MarianTokenizer.from_pretrained(\"Helsinki-NLP/opus-mt-ru-en\")\n\n\nYou can also be explicit in v5:\n\n\n tokenizer = AutoTokenizer.from_pretrained(\n \"Helsinki-NLP/opus-mt-ru-en\",\n backend=\"sentencepiece\",\n )\n\n\nThat only helps if `sentencepiece` is actually installed. (GitHub)\n\nSo the short explanation is:\n\n * **Not** “Python 3.12 broke Marian.”\n * **Not necessarily** “downgrade Transformers.”\n * **Most likely** : `sentencepiece` was missing or not visible in that runtime, and `AutoTokenizer` surfaced it as a confusing MarianConfig error. (GitHub)\n\n",
"title": "ValueError loading Helsinki-NLP tokenizers"
}