Seeking Advice: Qwen3.5-27B failing on Inference Endpoints — is Unsloth GGUF a viable alternative for text editing?
Hi Everyone, I’m interested in the Qwen3.5-27B model for machine translation post-editing, which would basically involve rewriting pre-provided machine translations into a specified style, following specific style and vocabulary guidelines. I tried an online deployment on Inference Endpoints by Hugging Face using the official Qwen3.5-27B repo and a number of different GPU configurations, including the recommended A100 2xGPU 160GB setup. No matter what configuration I tried, I was hit with the same Error:
Endpoint failed to start | Check Logs
Exit code: 1. Reason: ;0m vllm_config = engine_args.create_engine_config(usage_context=usage_context) e[0;36m(APIServer pid=1)e[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ e[0;36m(APIServer pid=1)e[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1369, in create_engine_config e[0;36m(APIServer pid=1)e[0;0m model_config = self.create_model_config() e[0;36m(APIServer pid=1)e[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^ e[0;36m(APIServer pid=1)e[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1223, in create_model_config e[0;36m(APIServer pid=1)e[0;0m return ModelConfig( e[0;36m(APIServer pid=1)e[0;0m ^^^^^^^^^^^^ e[0;36m(APIServer pid=1)e[0;0m File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__ e[0;36m(APIServer pid=1)e[0;0m s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s) e[0;36m(APIServer pid=1)e[0;0m pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig e[0;36m(APIServer pid=1)e[0;0m Value error, The checkpoint you are trying to load has model type `qwen3_5` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date. e[0;36m(APIServer pid=1)e[0;0m e[0;36m(APIServer pid=1)e[0;0m You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git` [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs] e[0;36m(APIServer pid=1)e[0;0m For further information visit https://errors.pydantic.dev/2.12/v/value_error
Based on the error, could it be that the vLLM container image used by Inference Endpoints ships with a version of Transformers that doesn’t yet support qwen3_5?. When looking for alternatives, I came across the Unsloth Qwen3.5-27B-GGUF repo, which runs on llama.cpp and deployed just fine.
My question is whether there would be any major differences between the official repo and the Unsloth repo for my use case? Or whether there’s any significant difference between vLLM and llama.cpp for this use case. I’m trying to find out whether I’d be OK to continue with the Unsloth repo or whether I should just wait until the official repo can deploy.
Discussion in the ATmosphere