Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibd2c24wgt2nkcvtnfpvjgmfximhbf2wkf5eki2quw5ncuz5y4dye",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mn2mkf5cd6y2"
  },
  "path": "/t/what-should-i-change-to-optimize-local-hosted-ai/176339#post_2",
  "publishedAt": "2026-05-30T06:48:20.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "llama.cpp server options",
    "llama.cpp SYCL backend docs",
    "B70 Q8_0 quantization ~4× slower than Q4_K_M — kernel efficiency issue",
    "Qwen3.6 server-intel regression: b9144 32.8 t/s vs b9159 25.8 t/s",
    "Qwen3.6-27B MTP on SYCL: truncated thinking / OOM / garbled output",
    "SYCL MTP on Intel Arc: correct output but no speed gain over baseline",
    "MTP acceptance-rate collapse at specific context sizes",
    "MTP -fit on VRAM accounting issue",
    "Qwen3.6 / server full prompt re-processing issue",
    "Continue autocomplete docs",
    "Continue model roles: Chat/Edit/Apply/Autocomplete/Embedding/Reranker",
    "SYCL multi-GPU GTT mirror issue",
    "server option docs",
    "Q8_0 being ~4× slower than Q4_K_M on Arc Pro B70",
    "B70 Q8_0 kernel efficiency issue",
    "llama.cpp server README",
    "Qwen3.6 on server-intel: b9144 32.8 t/s, b9159 25.8 t/s",
    "brutally bad SYCL performance on Battlemage",
    "Qwen3.6-27B MTP on SYCL: truncated thinking, OOM, garbled output",
    "SYCL MTP on Intel Arc: correct output but no speed gain",
    "-fit on ignores VRAM needed by built-in MTP draft context",
    "MTP speed decrease after cleanup/merge",
    "one issue",
    "Qwen3.6 27B full prompt re-processing / cache behavior",
    "server forces full prompt re-processing on subsequent prompts",
    "Continue model roles"
  ],
  "textContent": "Hmm… Based on known community findings, there do seem to be several settings worth improving:\n\n* * *\n\nYour setup is not obviously wrong — Ubuntu + llama.cpp/SYCL + 2× Arc Pro B70 + Qwen3.6-27B is a reasonable direction for local coding agents. However, Intel Arc/Battlemage + llama.cpp/SYCL + Qwen3.6 has several known performance traps, so I would first build a clean baseline and then change one variable at a time.\n\nThe most important distinction is:\n\n  * **prefill / prompt processing** : shown as `prompt eval time`; very important for coding agents because they repeatedly send repo context, diffs, tool results, logs, etc.\n  * **decode / generation** : shown as `eval time`; important for normal token-by-token response speed\n  * **stability / memory behavior** : especially important with multi-GPU SYCL, MTP, large context, and long-running agents\n\n\n\nRelevant upstream docs/issues:\n\n  * llama.cpp server options\n  * llama.cpp SYCL backend docs\n  * B70 Q8_0 quantization ~4× slower than Q4_K_M — kernel efficiency issue\n  * Qwen3.6 server-intel regression: b9144 32.8 t/s vs b9159 25.8 t/s\n  * Qwen3.6-27B MTP on SYCL: truncated thinking / OOM / garbled output\n  * SYCL MTP on Intel Arc: correct output but no speed gain over baseline\n  * MTP acceptance-rate collapse at specific context sizes\n  * MTP -fit on VRAM accounting issue\n  * Qwen3.6 / server full prompt re-processing issue\n  * Continue autocomplete docs\n  * Continue model roles: Chat/Edit/Apply/Autocomplete/Embedding/Reranker\n\n\n\n## TL;DR: highest-value things I would test first\n\nArea | Current-looking setting | What I would test | Why\n---|---|---|---\nMeasurement | no explicit perf logging | add `--perf` | Without `prompt eval` vs `eval`, tuning is guesswork\nGPU split | `--split-mode layer --tensor-split 1,1` | try `--split-mode none --main-gpu 0` | 27B Q5 may fit on one B70; dual-GPU split may add latency\nThreads | `--threads 24` | try `--threads 8 --threads-batch 16` | GPU-offloaded inference often does not benefit from many CPU threads\nNUMA | `--numa distribute` | remove it first | likely not useful on a normal single-socket workstation\nKV cache | `q8_0/q8_0` | compare `f16`, `q8_0`, `q4_0` | Arc/B70 quant paths can behave very differently\nQuant | Q5_K_M only | compare Q4_K_M vs Q5_K_M | Q4 may be much better latency on B70\nFlash Attention | unspecified/auto | test `-fa on` vs `auto` | often relevant for long-context workloads\nBuild | moving target? | pin/compare builds | known `server-intel-b9159` regression exists\nContinue | one big model for everything? | split autocomplete to smaller model | autocomplete should be latency-optimized\nMTP | tempting | leave it for later | Qwen3.6 MTP + SYCL still has sharp edges\n\n## 1. Add `--perf` before changing anything\n\nFirst, keep your current command but add:\n\n\n    --perf\n\n\nThen look for:\n\n\n    prompt eval time\n    eval time\n    tokens per second\n\n\nInterpretation:\n\n  * slow `prompt eval` = context/prefill problem\n  * slow `eval` = generation/quant/backend/split problem\n  * slow in Continue but not in `llama-bench` = likely agentic-context or client-side request pattern problem\n\n\n\nFor coding agents, `prompt eval` is often the hidden bottleneck. A model can look fine on short prompts or `tg128`, but feel bad in Continue because every agent step re-sends large context.\n\n## 2. Test single GPU before dual-GPU layer split\n\nYour current-style setup appears to use:\n\n\n    --split-mode layer \\\n    --tensor-split 1,1\n\n\nI would absolutely compare that with single-GPU mode:\n\n\n    --split-mode none \\\n    --main-gpu 0\n\n\nOptionally also pin the SYCL device:\n\n\n    export ONEAPI_DEVICE_SELECTOR=level_zero:0\n\n\nWhy this may help:\n\n  * a 27B Q5_K_M model may fit on a single 32 GB B70\n  * layer split helps capacity, but does not guarantee better single-user latency\n  * decode often does not scale well across multiple GPUs\n  * multi-GPU SYCL may increase host-memory pressure\n  * Intel multi-GPU SYCL has a known host-side GTT mirror behavior: SYCL multi-GPU GTT mirror issue\n\n\n\nIf single GPU is faster or similarly fast, I would use the second B70 for another service instead:\n\n\n    GPU 0: Qwen3.6-27B for chat/edit/agent\n    GPU 1: Qwen2.5-Coder 1.5B/7B for autocomplete, or embeddings/reranking\n\n\nFor a single-user coding workstation, two independent services can feel better than one model split over two GPUs.\n\n## 3. Reduce CPU threads and set batch threads separately\n\nIf you currently use:\n\n\n    --threads 24\n\n\nI would compare:\n\n\n    --threads 8 \\\n    --threads-batch 16\n\n\nand:\n\n\n    --threads 4 \\\n    --threads-batch 16\n\n\nand:\n\n\n    --threads 8 \\\n    --threads-batch 8\n\n\n`--threads` and `--threads-batch` are separate llama.cpp server knobs. `--threads` is more relevant to generation-side CPU work, while `--threads-batch` matters for batch/prompt processing. See the official server option docs.\n\nWith most layers on GPU, more CPU threads are not always better. Too many threads can add scheduling overhead or just not help. For coding agents, `--threads-batch` can matter more because large prompt ingestion is common.\n\n## 4. Remove `--numa distribute` unless this is really a NUMA machine\n\nIf this is a normal single-socket desktop/workstation system, I would remove:\n\n\n    --numa distribute\n\n\nBaseline should probably be no NUMA setting. Only test NUMA modes later if you know the machine is actually NUMA-relevant.\n\n## 5. Do not assume `q8_0` KV cache is fastest\n\nYour command uses:\n\n\n    --cache-type-k q8_0 \\\n    --cache-type-v q8_0\n\n\nThat may be good for VRAM, but it should be measured. Compare:\n\n\n    --cache-type-k f16 \\\n    --cache-type-v f16\n\n\n\n    --cache-type-k q8_0 \\\n    --cache-type-v q8_0\n\n\n\n    --cache-type-k q4_0 \\\n    --cache-type-v q4_0\n\n\nThe B70-specific reason to test this is that quantized paths on Battlemage can behave surprisingly. The clearest known example is Q8_0 being ~4× slower than Q4_K_M on Arc Pro B70. That issue is about model weights, not KV cache, so it does **not** prove `q8_0` KV is bad. But it does prove that on B70, “higher bit = safer/faster” is not a reliable assumption.\n\nThe same issue also notes that `-DGGML_SYCL_F16=ON` improved prompt processing by about 2.4× in one Q4_K_M case, while not improving token generation. That is another clue that prefill and decode must be tuned separately.\n\n## 6. Test Q4_K_M against Q5_K_M\n\nQ5_K_M is reasonable, but for local coding latency I would compare:\n\n\n    Qwen3.6-27B-Q4_K_M\n    Qwen3.6-27B-Q5_K_M\n    Qwen3.6-27B-Q6_K\n\n\nSuggested order:\n\n  1. Q4_K_M baseline\n  2. Q5_K_M quality comparison\n  3. Q6_K only if you still have enough speed/VRAM\n\n\n\nOn B70, Q4_K_M may be a better practical latency/quality point than Q5_K_M. The B70 Q8_0 issue is the strongest warning that quant performance on this architecture is not always intuitive: B70 Q8_0 kernel efficiency issue.\n\n## 7. Explicitly test Flash Attention\n\nTry:\n\n\n    -fa on\n\n\nand compare with:\n\n\n    -fa auto\n\n\nand maybe:\n\n\n    -fa off\n\n\nFor long-context coding workloads, Flash Attention can matter, but it should still be measured. The option is documented in the llama.cpp server README.\n\n## 8. Prefer `--n-gpu-layers all` over `999`\n\nIf the intention is “offload everything possible,” use:\n\n\n    --n-gpu-layers all\n\n\ninstead of:\n\n\n    --n-gpu-layers 999\n\n\nThis is mostly clarity, not a guaranteed performance change. The server docs support `auto`, `all`, and numeric values.\n\n## 9. Pin builds; avoid moving `latest`\n\nThere is a very relevant Intel build regression report:\n\n  * Qwen3.6 on server-intel: b9144 32.8 t/s, b9159 25.8 t/s\n\n\n\nThat issue is Qwen3.6-35B-A3B-MTP on Arc Pro B50, not exactly your 27B dense setup, so it is not proof that your setup is affected. But it is close enough to justify build pinning and A/B testing.\n\nRecord:\n\n\n    ./build/bin/llama-server --version\n    sycl-ls\n    uname -a\n\n\nIf using Docker, compare pinned images rather than `latest`:\n\n\n    ghcr.io/ggml-org/llama.cpp:server-intel-b9144\n    ghcr.io/ggml-org/llama.cpp:server-intel-b9159\n\n\nWith Intel Arc + SYCL, performance can depend on:\n\n  * llama.cpp commit\n  * Intel compute-runtime\n  * oneAPI version\n  * Linux kernel / driver stack\n  * Docker image contents\n  * whether `i915` or `xe` is used\n  * ReBAR / Above 4G / PCIe platform behavior\n\n\n\n## 10. Check platform basics: ReBAR, PCIe, driver stack\n\nIf performance is much lower than other B70 reports, I would verify platform-level things too.\n\nUseful checks:\n\n\n    lspci -vv | grep -i -E \"Resizable BAR|Region|prefetchable\" -A3\n    lspci -nnk | grep -i -E \"VGA|Display|3D\" -A4\n    sycl-ls\n    uname -a\n\n\nThings to confirm:\n\n\n    Above 4G Decoding: enabled\n    Resizable BAR: enabled\n    PCIe link width/speed: expected width/speed\n    driver stack: i915 vs xe\n    Intel compute-runtime version\n    oneAPI version\n    kernel version\n\n\nThere is also a relevant report of very poor SYCL performance on older DDR4 / PCIe 3.0 platform with Battlemage: brutally bad SYCL performance on Battlemage. Your system sounds much newer, but it is still worth verifying ReBAR/PCIe/driver basics.\n\n## 11. Treat MTP as a later experiment, not the first fix\n\nQwen3.6 MTP is interesting, but I would not add it until the non-MTP baseline is clean.\n\nRelevant issues:\n\n  * Qwen3.6-27B MTP on SYCL: truncated thinking, OOM, garbled output\n  * SYCL MTP on Intel Arc: correct output but no speed gain\n  * MTP acceptance-rate collapse at specific context sizes\n  * -fit on ignores VRAM needed by built-in MTP draft context\n  * MTP speed decrease after cleanup/merge\n\n\n\nIf you test MTP later, start conservatively:\n\n\n    --parallel 1 \\\n    --spec-type draft-mtp \\\n    --spec-draft-n-max 2\n\n\nAvoid assuming MTP is faster just because draft acceptance is high. On SYCL/Intel Arc, one issue specifically reports correct output but no speed gain, with per-kernel dispatch overhead identified as the remaining bottleneck.\n\nAlso watch for:\n\n\n    draft acceptance rate\n    VRAM before/after requests\n    forcing full prompt re-processing\n    create context checkpoint\n    OOM after multiple requests\n\n\n## 12. Watch for full prompt re-processing\n\nFor Qwen3.6 and agentic use, look for:\n\n\n    forcing full prompt re-processing\n\n\nRelevant issue:\n\n  * Qwen3.6 27B full prompt re-processing / cache behavior\n  * Related generic server prompt-cache issue: server forces full prompt re-processing on subsequent prompts\n\n\n\nIf this appears often, the problem may not be raw GPU throughput. It may be prompt cache invalidation, slot reuse behavior, hybrid attention/recurrent-memory behavior, client request shape, or MTP interaction.\n\nFor a single local coding-agent user, I would initially force:\n\n\n    --parallel 1\n\n\nand only increase parallelism after the baseline is stable.\n\n## 13. Continue.dev: separate autocomplete from chat/agent\n\nContinue has different model roles: Chat, Edit, Apply, Autocomplete, Embedding, Reranker, etc. See Continue model roles.\n\nFor autocomplete specifically, Continue recommends smaller/faster models such as Qwen Coder 2.5 1.5B or 7B: Continue autocomplete docs. The docs also note that thinking-type models are generally not recommended for autocomplete because they generate more slowly.\n\nA practical split:\n\n\n    localhost:8080 -> Qwen3.6-27B-Q4_K_M or Q5_K_M for chat/edit/agent\n    localhost:8081 -> Qwen2.5-Coder-1.5B or 7B for autocomplete\n\n\nThis can improve perceived responsiveness a lot. Autocomplete should not be queued behind a large 27B agent request.\n\n## 14. Suggested clean baseline\n\nI would start with something like this:\n\n\n    #!/bin/bash\n    source /opt/intel/oneapi/setvars.sh --force\n\n    export ZES_ENABLE_SYSMAN=1\n    export ONEAPI_DEVICE_SELECTOR=level_zero:0\n    export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1\n\n    cd ~/llama.cpp\n\n    ./build/bin/llama-server \\\n      -m ~/models/Qwen3.6-27B-Q4_K_M.gguf \\\n      -a Roboto \\\n      -c 32768 \\\n      -fa on \\\n      --cache-type-k f16 \\\n      --cache-type-v f16 \\\n      --n-gpu-layers all \\\n      -b 2048 \\\n      -ub 512 \\\n      --threads 8 \\\n      --threads-batch 16 \\\n      --host 0.0.0.0 \\\n      --port 8080 \\\n      --split-mode none \\\n      --main-gpu 0 \\\n      --parallel 1 \\\n      --perf\n\n\nThis is not guaranteed to be best. It is just a cleaner baseline:\n\n  * one GPU\n  * no NUMA complication\n  * explicit Flash Attention\n  * explicit KV type\n  * explicit `threads` vs `threads-batch`\n  * explicit `parallel 1`\n  * perf logging\n  * Q4_K_M as a latency-first starting point\n\n\n\nThen change only one thing at a time.\n\n## 15. Suggested A/B order\n\n### Step 0: current setup + perf\n\nAdd only:\n\n\n    --perf\n\n\nSave the logs.\n\n### Step 1: single GPU\n\nChange:\n\n\n    --split-mode layer \\\n    --tensor-split 1,1\n\n\nto:\n\n\n    --split-mode none \\\n    --main-gpu 0\n\n\nIf this is faster, dual-GPU split is probably not helping latency.\n\n### Step 2: threads\n\nTry:\n\n\n    --threads 8 \\\n    --threads-batch 16\n\n\ninstead of:\n\n\n    --threads 24\n\n\n### Step 3: remove NUMA\n\nRemove:\n\n\n    --numa distribute\n\n\n### Step 4: KV cache\n\nCompare:\n\n\n    --cache-type-k f16 --cache-type-v f16\n    --cache-type-k q8_0 --cache-type-v q8_0\n    --cache-type-k q4_0 --cache-type-v q4_0\n\n\n### Step 5: quant\n\nCompare:\n\n\n    Qwen3.6-27B-Q4_K_M\n    Qwen3.6-27B-Q5_K_M\n\n\n### Step 6: Flash Attention\n\nCompare:\n\n\n    -fa on\n    -fa auto\n    -fa off\n\n\n### Step 7: context size\n\nCompare:\n\n\n    -c 16384\n    -c 24576\n    -c 32768\n\n\nBigger context is not free. For coding agents, having 32K available is useful, but repeatedly filling it can make the system feel slow.\n\n### Step 8: pinned builds\n\nCompare known builds / commits, especially if using Docker or recent server-intel images.\n\n### Step 9: only then try MTP\n\nOnly after non-MTP is stable, try:\n\n\n    --parallel 1 \\\n    --spec-type draft-mtp \\\n    --spec-draft-n-max 2\n\n\nand compare it against non-MTP.\n\n## 16. What I would put in the log comparison table\n\nSomething like this:\n\n\n    Config name:\n    Model:\n    Quant:\n    Backend:\n    llama.cpp version:\n    Intel compute-runtime:\n    oneAPI:\n    Kernel:\n    Driver stack:\n    GPU split:\n    KV type:\n    Context size:\n    Batch / ubatch:\n    Threads / threads-batch:\n    Flash Attention:\n    Parallel slots:\n\n    prompt eval t/s:\n    eval t/s:\n    VRAM used:\n    System RAM:\n    GTT mirror, if checked:\n    Notes:\n\n\nFor multi-GPU SYCL memory behavior, this may help:\n\n\n    PID=$(pgrep llama-server)\n\n    for fd in /proc/$PID/fdinfo/*; do\n      grep -H \"drm-total-gtt\\|drm-total-vram\" \"$fd\" 2>/dev/null\n    done\n\n\n## 17. My best guess\n\nMy guess is that the biggest practical wins will come from:\n\n  1. **single-GPU baseline instead of dual-GPU layer split**\n  2. **lower CPU thread count with explicit`--threads-batch`**\n  3. **removing`--numa distribute`**\n  4. **testing Q4_K_M vs Q5_K_M**\n  5. **testing KV`f16` vs `q8_0`**\n  6. **pinning known-good llama.cpp / server-intel builds**\n  7. **separating Continue autocomplete onto a smaller model**\n\n\n\nI would not start by chasing MTP, huge context sizes, or experimental split modes. First make the normal Qwen3.6-27B path fast and reproducible.",
  "title": "What should i change to optimize local hosted AI"
}