{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreigeo5mbe3agpy6ikkodlfxhl3swjmgtzqsbx4aev56mb5wia6h2ju",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mfwf37zi3af2"
},
"path": "/t/seeking-advice-qwen3-5-27b-failing-on-inference-endpoints-is-unsloth-gguf-a-viable-alternative-for-text-editing/173879#post_2",
"publishedAt": "2026-02-28T12:37:25.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"Hugging Face",
"GitHub",
"vLLM",
"Unsloth",
"Red Hat Developer",
"qwen.readthedocs.io"
],
"textContent": "If you can use the nightly build of vLLM, you might be able to bypass the restrictions?\n\n* * *\n\n## 1) What your endpoint error actually means\n\nYour failure happens **before** any real “GPU sizing / memory” problems come into play.\n\nvLLM is trying to build a `ModelConfig` by asking **Transformers** to interpret the model’s `config.json`. That config declares a **new architecture identifier** :\n\n * `model_type: \"qwen3_5\"` (in the official checkpoint’s config)\n\n\n\nIf the Transformers version inside the Inference Endpoint container **doesn’t know what`qwen3_5` is**, `AutoConfig` can’t map it to a model class, and vLLM aborts with exactly the validation error you see.\n\nThat’s consistent with Hugging Face’s own Inference Endpoints runtime docs showing **Transformers 4.48.0** in at least some images—well before Qwen3.5 support existed upstream. (Hugging Face)\n\nSo yes: **your hypothesis is very plausible** —the container image you’re on is shipping a Transformers build that predates Qwen3.5 support.\n\n* * *\n\n## 2) Why this is happening “now”: Qwen3.5 support landed extremely recently\n\nTwo key timeline facts:\n\n * The **Transformers PR that adds Qwen3.5 support** (“Adding Support for Qwen3.5”) was **merged on Feb 9, 2026**. (GitHub)\n * The **Transformers release notes** show Qwen3.5 support called out in **v5.2.0** (Feb 2026) and mention installing with `pip install transformers --pre` for the v5 release candidates. (GitHub)\n\n\n\nIn other words, Qwen3.5 is in the awkward window where:\n\n * the **official weights are published** , but\n * many **serving images** are still pinned to older Transformers builds, and\n * even if you upgrade Transformers, your serving stack may have **version constraints** (see below).\n\n\n\n* * *\n\n## 3) Why “just upgrade Transformers” is not always trivial with vLLM (especially in managed containers)\n\nIn managed environments, you typically inherit whatever the image pins. vLLM has historically pinned Transformers in ways that can lag brand-new architectures; there are recent vLLM issues where models require Transformers versions that aren’t compatible with vLLM’s current constraints. (GitHub)\n\nAlso, vLLM itself is in flux for Qwen3.5: the vLLM team’s own **Qwen3.5 recipe** says to use **vLLM nightly** “until 0.17.0 is released,” which is a strong signal that stable releases may not yet cover all Qwen3.5 edges. (vLLM)\n\nSo on Inference Endpoints, unless you can:\n\n * switch to an image that already includes the needed Transformers commit, or\n * install Transformers-from-source inside the container, or\n * bring a **custom container** ,\n\n\n\n…you can get stuck exactly where you are.\n\n* * *\n\n## 4) Is Unsloth’s GGUF a viable alternative for your use case (post-editing / rewriting)? Yes—with specific caveats\n\n### 4.1 What is _actually different_ between “official” and “Unsloth GGUF”\n\nFor your text-only post-editing workflow, the meaningful differences are usually:\n\n 1. **Weight format + quantization**\n\n\n * Official repo: typically **BF16/FP16** weights loaded via Transformers/vLLM.\n * GGUF repo: weights converted for **llama.cpp** , usually **quantized** (Q4/Q5/Q6/Q8, plus “UD-” variants).\n\n\n\nUnsloth’s Qwen3.5-27B-GGUF repo explicitly provides multiple quantizations (e.g., **Q4_K_M, Q5_K_M, Q6_K, Q8_0** , plus UD variants). (Hugging Face)\n\n 2. **Inference engine**\n\n\n * Official on vLLM: GPU-first serving, high throughput under concurrency.\n * GGUF on llama.cpp: optimized for portability and efficiency, often excellent on single node / smaller GPUs / CPU-offload.\n\n\n 3. **Multimodal handling (only matters if you use vision)**\nUnsloth includes an `mmproj` file (projection weights for multimodal in llama.cpp) alongside the GGUFs. (Hugging Face)\nIf you’re purely doing text post-editing, you can ignore multimodal.\n\n\n\n### 4.2 The caveat that matters most for post-editing: quantization can change “style obedience”\n\nYour task (“rewrite the given translation into a specified style, obey vocabulary/glossary rules”) is **sensitive to small model-quality regressions**. Quantization _can_ :\n\n * slightly reduce instruction fidelity,\n * increase minor wording drift,\n * weaken consistency on strict terminology.\n\n\n\n**Practical implication:** if you stay on GGUF, prefer higher-quality quants:\n\n * **Q8_0** (highest fidelity, largest)\n * **Q6_K** (often a strong quality/size trade)\n * be cautious with Q4 variants if your style guide is strict.\n\n\n\n(You don’t need to guess—run an A/B test on your real post-edit set; see §7.)\n\n### 4.3 “Thinking mode” / verbosity differences can bite text-editing pipelines\n\nSome Qwen3.5 builds expose “thinking vs non-thinking” behavior. If your pipeline expects **only the final rewritten text** , you must ensure the runtime isn’t emitting internal reasoning or long “thinking” blocks.\n\nUnsloth’s llama.cpp instructions show using `--chat-template-kwargs \"{\\\"enable_thinking\\\": false}\"` for Qwen3.5. (Unsloth)\nThere are also community reports of “still thinking” behavior in some setups, so validate your exact llama.cpp build + template behavior early. (Hugging Face)\n\n* * *\n\n## 5) vLLM vs llama.cpp for post-editing: what differs in practice\n\nHere’s the decision in the dimensions that matter for _translation post-editing_.\n\nDimension | vLLM (Transformers weights) | llama.cpp (GGUF)\n---|---|---\n**Output quality ceiling** | Highest (BF16/FP16, minimal approximation) | Depends on quant (Q8≈close; Q4 can be noticeable) (Hugging Face)\n**Throughput under concurrency** | Typically excellent (batched serving) | Often less optimized for many concurrent users (still workable) (Red Hat Developer)\n**Operational friction today** | Blocked for you due to missing `qwen3_5` support in the image | Works now (you already deployed) (Hugging Face)\n**Long-context performance** | Often better once fully supported | Can be limited/slow depending on arch support; Qwen3.5 uses hybrid/linear-attn mechanisms that may have performance cliffs (qwen.readthedocs.io)\n**Best use case** | Production, high volume, multi-tenant, GPU-rich | “Get it running” deployments, smaller GPU budgets, portable inference, quick iteration (Red Hat Developer)\n\nTwo extra notes:\n\n * Qwen’s own materials explicitly mention using frameworks like **vLLM** for deployment and also note **llama.cpp supports Qwen3.5** (text & vision) and to look for GGUF models. (GitHub)\n * vLLM’s own Qwen3.5 recipe currently recommends **nightly** builds, indicating rapid stabilization. (vLLM)\n\n\n\n* * *\n\n## 6) So… should you continue with Unsloth GGUF or wait for official vLLM?\n\n### You’re likely OK to continue with Unsloth **if** :\n\n * you can use **Q6_K or Q8_0** , and\n * you verify that “thinking” is disabled and outputs are clean, and\n * your workload is not heavily multi-tenant/high-concurrency _right now_.\n\n\n\nFor many post-editing pipelines, that’s enough to be production-viable.\n\n### You should push for official (or a custom vLLM image) **if** you need:\n\n * **maximum style fidelity** (especially strict terminology/glossary adherence),\n * **high concurrency throughput** ,\n * **very long inputs** (large context windows) with predictable speed,\n * or multimodal features in a standardized serving stack.\n\n\n\n**Concrete path (if you control the container):**\n\n * Use a vLLM build aligned with the Qwen3.5 guidance (nightly per vLLM’s recipe). (vLLM)\n * Use a Transformers build that includes Qwen3.5 support (post-merge PR / v5.2.0+). (GitHub)\nIn fully managed images, this typically means “switch image” or “custom container,” not just toggling GPU size.\n\n\n\n* * *\n\n## 7) How to evaluate “major differences” for your exact post-editing task (fast, reliable)\n\nRun a small bake-off that reflects your real constraints.\n\n### 7.1 Build a test set (30–200 items)\n\nFor each item keep:\n\n * **Source text**\n * **Machine translation**\n * **Style guide excerpt + glossary/forbidden terms**\n * **Gold post-edit** (if you have it) or at least a human rating rubric\n\n\n\n### 7.2 Compare these conditions\n\n * Official (when possible): BF16/FP16 on vLLM\n * GGUF: **Q8_0** and **Q6_K** (optionally Q5_K_M)\n\n\n\n### 7.3 Measure what matters for post-editing\n\n * **Terminology accuracy** (glossary terms always used, forbidden terms never used)\n * **Meaning preservation** (human check or targeted heuristics)\n * **Style compliance** (human rubric, or pattern checks if style is formalized)\n * **Editing stability** (does it rewrite only what’s needed vs over-edit?)\n\n\n\nIf Q6/Q8 GGUF matches your thresholds, you have a defensible “good enough now” solution.\n\n* * *\n\n## 8) Practical recommendations if you stick with GGUF (llama.cpp)\n\n 1. Prefer **Q6_K / Q8_0** for style-sensitive rewriting. (Hugging Face)\n 2. Force non-thinking outputs via the template kwargs and validate with a unit test (one request, assert no hidden reasoning text). (Unsloth)\n 3. Use conservative decoding for post-editing:\n\n\n * temperature ~0–0.3 (or equivalent), avoid overly creative sampling\n * keep repetition penalties modest (post-editing often needs repeated terminology)\n\n\n 4. Keep prompts “post-editor shaped”:\n\n\n * explicitly: “Rewrite ONLY the provided translation; do not add content; preserve meaning; follow glossary; output final text only.”\n\n\n\n* * *\n\n## Bottom line\n\n * Your Inference Endpoints failure is consistent with a **Transformers-too-old** situation for the new `qwen3_5` architecture. (Hugging Face)\n * Unsloth GGUF + llama.cpp is a **viable** path for machine-translation post-editing, with the main tradeoff being **quantization + template/runtime behavior** , not a fundamentally different model. (Hugging Face)\n * If you need maximum fidelity + high-throughput serving, aim for a vLLM stack aligned with the now-landed upstream support (Transformers v5.2.0+/post-merge and vLLM builds that match Qwen3.5 guidance). (GitHub)\n\n",
"title": "Seeking Advice: Qwen3.5-27B failing on Inference Endpoints — is Unsloth GGUF a viable alternative for text editing?"
}