Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiejexqnzzumvl3gzmwik5krt3gdsoqmdfoc2ogdqarpgfu4qjtvla",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mijeies6ca62"
  },
  "path": "/t/bug-in-google-colab-assemble-everything-pytorch/174892#post_2",
  "publishedAt": "2026-04-02T13:21:59.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "GitHub",
    "PyPI"
  ],
  "textContent": "The cause is likely version drift resulting from the major update from Transformers v4 to v5. While this is a common occurrence, this particular model seems to have slightly more compatibility issues than typical models:\n\n* * *\n\nThe main cause is **version drift**.\n\nYour notebook installs `transformers[sentencepiece]` with no version pin, then immediately loads `tblard/tf-allocine` through `AutoTokenizer.from_pretrained(...)`. That model repo is old, CamemBERT-based, and its files are legacy-shaped: `sentencepiece.bpe.model`, `special_tokens_map.json`, a **2-byte** `tokenizer_config.json`, and **TensorFlow weights only** as `tf_model.h5`. Hugging Face released `transformers` v5 on **January 26, 2026** , and PyPI now serves **5.4.0** as latest, so a notebook that once worked can now pull a substantially different tokenizer stack than it originally expected. (Hugging Face)\n\n## What is happening in your notebook\n\nThe first fragile step is not the PyTorch model load. It is the tokenizer load.\n\nIn your notebook, the first real model-related operation is:\n\n\n    from transformers import AutoTokenizer\n\n    checkpoint = \"tblard/tf-allocine\"\n    tokenizer = AutoTokenizer.from_pretrained(checkpoint)\n\n\nSo if the run fails there, the problem is already present **before** PyTorch inference logic matters. That fits the current upstream bug pattern almost exactly. In March 2026, Hugging Face had multiple reports where `AutoTokenizer.from_pretrained(...)` failed for older CamemBERT-family models inside `transformers/models/camembert/tokenization_camembert.py` with `ValueError: too many values to unpack (expected 2)`, and the same reports say the models worked on `transformers` 4.57.x but failed on 5.x. (GitHub)\n\n## Why it breaks now\n\n`transformers` v5 changed tokenizer internals in a major way.\n\nThe v5 release notes and migration guide say Hugging Face is moving away from the old slow/fast tokenizer split, consolidating to a single tokenizer file per model, preferring the `tokenizers` backend, and supporting SentencePiece through a lighter compatibility layer. The release notes also describe v5 as the first major release in five years and explicitly call out **tokenization** as one of the significant API changes. That is exactly the area touched by your checkpoint. (GitHub)\n\nCamemBERT is also the right family to suspect here. The official CamemBERT docs say its tokenizer uses a **SentencePiece vocab file** and that the fast tokenizer is **Unigram-based**. Your model card labels the checkpoint as **CamemBERT** , and the files page shows a SentencePiece model file. So the upstream tokenizer refactor and the model’s storage format intersect directly in your case. (Hugging Face)\n\n## Why this checkpoint is extra fragile\n\nThis checkpoint is not a modern PyTorch-first repo.\n\nThe model card’s usage example is:\n\n\n    from transformers import AutoTokenizer, TFAutoModelForSequenceClassification\n    tokenizer = AutoTokenizer.from_pretrained(\"tblard/tf-allocine\")\n    model = TFAutoModelForSequenceClassification.from_pretrained(\"tblard/tf-allocine\")\n\n\nAnd the files page shows `tf_model.h5`, not a native PyTorch weight file. That means the notebook is doing two things at once:\n\n  1. loading an older tokenizer format, and\n  2. asking PyTorch to load from TensorFlow weights later with `from_tf=True`.\n\n\n\nThe second part is legal. Older AutoModel docs explicitly show `from_tf=True` for loading a TensorFlow checkpoint into a PyTorch auto-model. But it is still an extra compatibility layer after the tokenizer problem. (Hugging Face)\n\n## So what is the actual root cause\n\nFor your case, I would rank the causes like this:\n\n### 1. Primary cause\n\nA **current`transformers` v5 tokenizer regression or incompatibility** with some older CamemBERT-family SentencePiece checkpoints. The closest public reports show the same code path and same exception, and both point to v5 breaking cases that worked on v4.57.x. (GitHub)\n\n### 2. Enabler\n\nThe notebook’s install line is **unpinned** , so Colab fetches the latest library stack instead of the stack the lesson originally expected. (PyPI)\n\n### 3. Secondary complication\n\nThe checkpoint is **TensorFlow-native on the Hub** , so the PyTorch version of the notebook depends on `from_tf=True` and conversion logic after tokenizer loading succeeds. (Hugging Face)\n\n## What it is _not_\n\nIt is probably **not** primarily a PyTorch bug.\n\nWhy: the strongest matching public failures break at `AutoTokenizer.from_pretrained(...)`, not deep inside a model forward pass. Also, your notebook installs `transformers[sentencepiece]`, so this is less likely to be the simple “you forgot SentencePiece” class of failure. A stale Colab kernel can still make optional dependencies invisible, but the bigger pattern here is the v5 tokenizer change plus an old checkpoint. (GitHub)\n\n## Best fix for this notebook\n\nUse a **known-good v4 stack** and restart the runtime.\n\nReplace the install cell with:\n\n\n    !pip -q uninstall -y transformers tokenizers\n    !pip -q install \"transformers==4.57.1\" \"tokenizers==0.22.1\" \"sentencepiece>=0.1.99,<0.3\"\n\n\nThen **restart the Colab runtime** and run the notebook again from the top.\n\nWhy this is the safest fix:\n\n  * a very similar public issue reports **broken** on `transformers 5.2.0 / tokenizers 0.22.2` and **working** on `transformers 4.57.1 / tokenizers 0.22.1`;\n  * another similar issue reports **broken** on `5.3.0` and **working** on `4.57.6`. (GitHub)\n\n\n\n## Keep this line in the PyTorch cell\n\nThis part of the notebook is still correct:\n\n\n    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, from_tf=True)\n\n\nThe reason is simple: the Hub repo exposes TensorFlow weights, not native PyTorch weights. If you remove `from_tf=True`, you would be fighting the wrong problem. (Hugging Face)\n\n## A cleaner alternative\n\nIf your goal is only to finish the lesson with minimal friction, there are two cleaner paths:\n\n### Path A\n\nStay with this checkpoint, but run it in the **TensorFlow style shown on the model card** with `TFAutoModelForSequenceClassification`. That matches how the checkpoint was published. (Hugging Face)\n\n### Path B\n\nKeep the lesson in PyTorch, but switch to a **newer PyTorch-native sentiment checkpoint** instead of using an old TensorFlow-first repo. That is the more future-proof route, because it removes both the tokenizer-compatibility risk and the TF-to-PT conversion layer. This second option would need model selection work, so it is more of a migration choice than a quick fix.\n\n## Background lesson\n\nThis is a standard ML-notebook failure mode:\n\n  * the notebook code still looks correct,\n  * the model repo still exists,\n  * the tutorial step is still logically fine,\n  * but a floating dependency pulls in a major library release,\n  * and an older checkpoint hits a new compatibility edge.\n\n\n\nThat is why this feels sudden. The notebook did not become conceptually wrong. Its **environment assumptions expired**. The v5 release notes explicitly frame v5 as a major release with significant API and tokenizer changes, and the current PyPI release cadence makes unpinned notebooks drift quickly. (GitHub)\n\n## Bottom line\n\nFor your notebook, the clearest explanation is:\n\n  * **old checkpoint**\n  * **new tokenizer stack**\n  * **unpinned Colab install**\n  * **TensorFlow-first model used from a PyTorch notebook**\n\n\n\nThe safest solution is to **pin back to a v4.57.x environment** , restart the runtime, and keep `from_tf=True` for the model load. That addresses the real failure point and matches the closest current upstream evidence. (GitHub)\n\n* * *\n\n## Cells that truly need fixing\n\n### **Cell 2**\n\nCurrent:\n\n\n    !pip install transformers[sentencepiece]\n\n\nThis is the main problem. It installs a floating latest version, and `transformers` v5 introduced major tokenization changes with weekly minor releases after the v5 launch. That makes old notebooks drift into new behavior. Current upstream reports show very similar CamemBERT-family tokenizer failures on v5 that work on v4.57.x. (GitHub)\n\nReplace it with:\n\n\n    # Run once, then restart the Colab runtime.\n    !pip -q uninstall -y transformers tokenizers sentencepiece\n    !pip -q install \"transformers==4.57.1\" \"tokenizers==0.22.1\" \"sentencepiece>=0.1.99,<0.3\"\n\n\n### **Cell 11**\n\nCurrent:\n\n\n    import torch\n    from transformers import AutoTokenizer, AutoModelForSequenceClassification\n\n    checkpoint = \"tblard/tf-allocine\"\n    tokenizer = AutoTokenizer.from_pretrained(checkpoint)\n    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, from_tf=True)\n    sequences = [\n        \"J'ai attendu un cours de HuggingFace toute ma vie.\",\n        \"Moi aussi !\",\n    ]\n\n    tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors=\"pt\")\n    output = model(**tokens)\n\n\nThis cell is **mostly correct already**. The important part is `from_tf=True`, because that checkpoint exposes TensorFlow weights (`tf_model.h5`) rather than native PyTorch weights. So this cell does **not** need a conceptual fix, only a light cleanup. (GitHub)\n\nCleaner version:\n\n\n    import torch\n    from transformers import AutoModelForSequenceClassification\n\n    checkpoint = \"tblard/tf-allocine\"\n\n    model = AutoModelForSequenceClassification.from_pretrained(\n        checkpoint,\n        from_tf=True,\n    )\n\n    sequences = [\n        \"J'ai attendu un cours de HuggingFace toute ma vie.\",\n        \"Moi aussi !\",\n    ]\n\n    tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors=\"pt\")\n\n    with torch.no_grad():\n        outputs = model(**tokens)\n\n    print(outputs.logits)\n\n\n## Cells that do **not** need fixing\n\n### **Cell 3**\n\n\n    from transformers import AutoTokenizer\n\n    checkpoint = \"tblard/tf-allocine\"\n    tokenizer = AutoTokenizer.from_pretrained(checkpoint)\n\n    sequence = \"J'ai attendu un cours d’HuggingFace toute ma vie.\"\n\n    model_inputs = tokenizer(sequence)\n\n\nThis is fine **once Cell 2 is fixed and the runtime is restarted**. The failure is here, but the root cause is upstream of it: the environment. So this cell does not need a logic fix. It only needs the install cell above it to stop pulling the bad version range.\n\n### **Cells 4 to 10**\n\nThese are demo cells for:\n\n  * single example tokenization\n  * batch tokenization\n  * padding\n  * truncation\n  * tensor return types\n  * token IDs and decoding\n\n\n\nThey are **not the cause of the break**. They may be cleaned up, but they do not need fixing for compatibility.\n\n## Small nuance\n\nIf you want the notebook to be more robust and easier to read, then I would still rewrite Cells 3 to 10 for consistency. But that is **optional cleanup** , not a necessary fix.\n\nSo the minimum necessary patch is:\n\n  * **Fix Cell 2**\n  * **Restart runtime**\n  * Keep **Cell 3**\n  * Keep **Cell 11** , maybe with a light cleanup\n\n\n\n## Minimum-diff patch\n\nIf you want the least editing possible, use only these two replacements.\n\n### Replace Cell 2 with\n\n\n    # Run once, then restart the Colab runtime.\n    !pip -q uninstall -y transformers tokenizers sentencepiece\n    !pip -q install \"transformers==4.57.1\" \"tokenizers==0.22.1\" \"sentencepiece>=0.1.99,<0.3\"\n\n\n### Replace Cell 11 with\n\n\n    import torch\n    from transformers import AutoModelForSequenceClassification\n\n    checkpoint = \"tblard/tf-allocine\"\n\n    model = AutoModelForSequenceClassification.from_pretrained(\n        checkpoint,\n        from_tf=True,\n    )\n\n    sequences = [\n        \"J'ai attendu un cours de HuggingFace toute ma vie.\",\n        \"Moi aussi !\",\n    ]\n\n    tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors=\"pt\")\n\n    with torch.no_grad():\n        outputs = model(**tokens)\n\n    print(outputs.logits)\n\n\n## Bottom line\n\n**Necessary fixes:** Cell 2.\n**Recommended cleanup:** Cell 11.\n**Optional cleanup only:** Cells 3 to 10.",
  "title": "Bug in Google Colab Assemble Everything (PyTorch)"
}