ValueError loading Helsinki-NLP tokenizers
But the mystery remains . . . WHY didn’t it work as advertised?!?!?
In Transformers, models and tokenizers may implicitly require backends outside of Transformers and PyTorch in some case.
In this case, uninstalling sentencepiece allowed me to reproduce the issue in my environment. Explicitly installing sentencepiece would be a quick workaround.
No, you probably do not need to downgrade transformers. Current Transformers supports Python 3.10+, so Python 3.12 itself is not the problem, and Marian is still a supported AutoTokenizer family in the docs. (PyPI)
What is happening is that Helsinki-NLP/opus-mt-* uses MarianTokenizer , and MarianTokenizer is SentencePiece-based. In current Transformers, the auto-mapping for Marian is effectively: use MarianTokenizer only if sentencepiece is available. The Marian tokenizer source also explicitly requires the sentencepiece backend. (GitHub)
So when sentencepiece is missing or not visible in the current runtime, AutoTokenizer.from_pretrained(...) can fail with that misleading error:
Unrecognized configuration class ... MarianConfig ...
even though MarianConfig is listed. That is a known class of bug/confusing behavior; there is a recent Transformers issue specifically about cryptic AutoTokenizer errors for SentencePiece tokenizers when sentencepiece is not installed. (GitHub)
That also explains why MarianTokenizer.from_pretrained(...) helped: it bypasses the auto-dispatch layer. If sentencepiece is available, direct MarianTokenizer works. If sentencepiece is truly missing, direct MarianTokenizer should fail too, but with a much clearer “install SentencePiece” error. (GitHub)
Use this fix:
pip install -U sentencepiece
Then restart the Python process / Jupyter kernel and retry:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ru-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-ru-en")
Or, as a workaround, use:
from transformers import MarianTokenizer
tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ru-en")
You can also be explicit in v5:
tokenizer = AutoTokenizer.from_pretrained(
"Helsinki-NLP/opus-mt-ru-en",
backend="sentencepiece",
)
That only helps if sentencepiece is actually installed. (GitHub)
So the short explanation is:
- Not “Python 3.12 broke Marian.”
- Not necessarily “downgrade Transformers.”
- Most likely :
sentencepiecewas missing or not visible in that runtime, andAutoTokenizersurfaced it as a confusing MarianConfig error. (GitHub)
Discussion in the ATmosphere