Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreighr3q62pbuqsonjvwjb4knrmmji5owi53zz4qxnsvhis464bj4wm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mi5ztcit6dr2"
  },
  "path": "/t/500-internal-server-error-with-ollama/174735#post_2",
  "publishedAt": "2026-03-29T00:36:05.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub",
    "Hugging Face",
    "Ollama"
  ],
  "textContent": "Maybe an Ollama + Qwen 3.5 series specific issue?\n\n* * *\n\nWhat I would look at first is **model compatibility, not disk contents and not raw VRAM**. Your symptom pattern is: the Hugging Face repo resolves, the GGUF blob downloads, Ollama stores it under a SHA256 blob path, and then the model fails during the **load/init** phase with a local `500 Internal Server Error`. Recent Ollama issues show the same pattern for Hugging Face Qwen3.5 GGUFs, including reports on 0.17.5 and 0.17.6. (GitHub)\n\n## What the error means\n\nThat blob path error does **not** mean “the file is missing.” It means Ollama found the local blob and then failed to initialize it. In other words:\n\n  * download worked\n  * manifest write worked\n  * model open/decode/load failed\n\n\n\nThat distinction matters because it points away from Hugging Face transport problems and toward **runtime incompatibility inside Ollama**. The live Qwen3.5 issue thread in Ollama says HF-downloaded `qwen35/qwen35moe` models can fail because their metadata layout differs from Ollama’s packaged variants, which causes decode failure and then fallback failure. (GitHub)\n\n## Why your exact repo is a higher-risk case\n\nYour target repo is not a simple text-only single-file package. The Hugging Face file tree labels it `Image-Text-to-Text`, includes `mmproj-BF16.gguf`, and ships separate quantized GGUF files like `Q4_K_M`, `Q5_K_M`, `Q6_K`, and `Q8_0`. The `Q8_0` file is listed as 28.6 GB and the projector file is about 931 MB. That means you are dealing with a **multimodal-flavored package** , not just “one plain text GGUF.” (Hugging Face)\n\nThat matters because current Ollama issue reports around HF Qwen3.5 are not just about “large model too big.” They are about **Qwen3.5 architecture handling** and, in neighboring reports, split text/vision style packaging that does not cleanly load through the current HF import path. (GitHub)\n\n## Why I do not think your main problem is “two 4090s are not enough”\n\nOllama’s FAQ says that when a model fits on one GPU, it prefers one GPU. If it does not fit on one GPU, it can spread the model across all available GPUs. So dual 4090s are a valid setup for larger models, and your hardware does not immediately point to “this should hard-fail before starting.” (Ollama)\n\nAlso, your failure is happening at the **load step** , not after a long prompt or a giant context window. That makes a metadata or architecture problem more likely than ordinary context-memory pressure. Context settings still matter later, but they are probably not the first thing breaking here. Ollama documents that parallelism and context length scale memory use, and that Flash Attention plus KV-cache quantization are later levers for memory reduction. (Ollama)\n\n## The strongest technical clue\n\nThe most important current Ollama issue for your case is the one stating that HF `qwen35/qwen35moe` models can have a different `attention.head_count_kv` representation from the Ollama library models. The issue says that causes `NewTextProcessor()` to fail during decode, and then the fallback path fails because support there was not merged yet. That is a very direct explanation for “download succeeds, then load fails.” (GitHub)\n\nThere is also a March 2026 duplicate report showing actual loader output that ends with:\n\n`error loading model architecture: unknown model architecture: 'qwen35'`\n\non Ollama 0.17.6, again from a Hugging Face Qwen3.5 GGUF run. That lines up closely with your symptom family. (GitHub)\n\n## One subtle background point\n\nOllama’s official import docs say GGUF import is supported and show both local GGUF import and adapter workflows, but the architecture list on that page names Llama, Mistral, Gemma, and Phi3. At the same time, Ollama’s own model library now has an official `qwen3.5:27b` page whose metadata shows `arch qwen35`. That combination suggests something important: **Ollama’s own packaged qwen3.5 models may work before arbitrary HF-imported qwen3.5 GGUFs do**. In other words, “Ollama supports qwen3.5” and “Ollama supports every HF qwen3.5 GGUF repo through `hf.co/...`” are not the same statement. (Ollama)\n\n## What I think is happening in your case\n\nMy ranking would be:\n\n  1. **Most likely:** current Ollama incompatibility with this HF Qwen3.5 GGUF import path. (GitHub)\n  2. **Also likely:** the repo’s multimodal `mmproj` packaging makes the load path more fragile. (Hugging Face)\n  3. **Less likely:** a true VRAM-size problem. (Ollama)\n  4. **Much less likely:** bad filename or failed download, because the loader got far enough to attempt model initialization from the local blob. (GitHub)\n\n\n\n## What to check next\n\n### 1. Look at the actual Ollama server log\n\nOn Ubuntu with systemd, Ollama’s troubleshooting page says to use:\n\n\n    journalctl -u ollama --no-pager --follow --pager-end\n\n\nThat is the single highest-value next step, because it will tell you whether the internal failure is `unknown model architecture: 'qwen35'`, a decode error, or a GPU-init problem. (Ollama)\n\n### 2. Turn on debug logging\n\nOllama’s Linux docs say you can add this systemd override:\n\n\n    [Service]\n    Environment=\"OLLAMA_DEBUG=1\"\n\n\nThen restart the service and re-run the model. That will give you better loader diagnostics than the CLI’s short `500` message. (Ollama)\n\n### 3. Sanity-check Ollama with the official packaged Qwen3.5\n\nTry:\n\n\n    ollama run qwen3.5:27b\n\n\nOllama’s official `qwen3.5:27b` page shows it is a packaged `qwen35` model, updated recently, with a 17 GB Q4_K_M artifact. If this works on your machine, then your CUDA path, service, and multi-GPU environment are probably fine, and the problem is specifically the Hugging Face imported model packaging. (Ollama)\n\n### 4. Do not use `Q8_0` as your first test\n\nYour target repo offers `Q4_K_M` at 16.5 GB, `Q5_K_M` at 19.2 GB, `Q6_K` at 22.1 GB, and `Q8_0` at 28.6 GB. Even if size is not the root cause, starting with `Q4_K_M` or `Q5_K_M` removes one extra variable. If the smaller quant fails with the same architecture-style error, that is more evidence that the core problem is compatibility, not memory pressure. (Hugging Face)\n\n### 5. Separate “HF import path” from “GGUF loader path”\n\nOllama’s import docs say you can import a local GGUF via a `Modelfile` using:\n\n\n    FROM /path/to/file.gguf\n\n\nand then build it with `ollama create my-model`. I do **not** expect this to magically fix a true `qwen35` loader incompatibility, because the same runtime still has to parse the model, but it is a useful diagnostic because it removes the HF fetch layer from the equation. (Ollama)\n\n## What not to chase first\n\nI would **not** start by deleting and redownloading the model repeatedly. The evidence points away from a bad transfer and toward a load-time compatibility problem. (GitHub)\n\nI would also **not** spend much time on the broken BF16 multipart discussion in that repo unless you specifically need BF16. Your current attempt is `Q8_0`, which is a different file. There is a repo discussion showing BF16 part 2 problems, but that is a separate issue and probably not your immediate blocker. (Hugging Face)\n\n## My bottom line\n\nYour setup is probably fine. Your command syntax was accepted, the download worked, and the file landed where Ollama stores model blobs. The most likely problem is that **this particular Hugging Face Qwen3.5 GGUF package is hitting a current Ollama loader/runtime gap** , and the repo’s multimodal-style packaging increases the odds of that. (GitHub)\n\nSo the best next move is:\n\n  * inspect `journalctl` logs\n  * enable `OLLAMA_DEBUG=1`\n  * test `ollama run qwen3.5:27b`\n  * retry with `Q4_K_M` or `Q5_K_M`\n  * if you need this exact Jackrong model and Ollama still fails, move it to a backend with stronger direct GGUF support\n\n",
  "title": "500 Internal Server Error with Ollama"
}