Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreieeh2ffrlwyyyz3w3jdgg43zmpkmj7odjvanyszfcqzfivduphpn4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmdbmvgvizj2"
  },
  "path": "/t/practical-match-for-128gb-strix-halo-with-2x3090s-inference-for-coding/175977#post_5",
  "publishedAt": "2026-05-21T00:41:22.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "huggingface.co",
    "openai/gpt-oss-120b · Hugging Face",
    "github.com/ggml-org/llama.cpp",
    "docs/multi-gpu.md",
    "master",
    "show original",
    "Performant local mixture-of-experts CPU inference with GPU acceleration in..."
  ],
  "textContent": "Hmm…\n\n* * *\n\nThere are two different questions mixed together here:\n\n  1. Can 2x3090s beat Strix Halo for dense coding models?\n  2. Can 2x3090s replace Strix Halo for `gpt-oss-120b`, especially at large context?\n\n\n\nFor the first one, your own result already says “probably yes”.\n\nFor the second one, I think the current test is not enough to say that.\n\nYour Qwen result looks normal to me:\n\n\n    Qwen3.6-27B-Q8_0\n    Halo:   7.8 tok/s\n    2x3090: 24 tok/s\n\n\nThat is the easy case. A dense-ish model fits in fast NVIDIA VRAM, so the 3090 box wins. I would expect that for many 20B-34B coding models, and probably many 70B Q4 cases too, depending on context/KV.\n\nBut I would not read the `gpt-oss-120b` result the same way:\n\n\n    gpt-oss-120b-Q4_K_M\n    Halo:   56 tok/s\n    2x3090: 8.8 tok/s\n\n\n`gpt-oss-120b` is not a normal dense 120B model. The model card says it is 117B total parameters, but only 5.1B active parameters per token. It is a MoE model, and the MoE weights are MXFP4. It is described as fitting on a single 80GB GPU.\n\nhuggingface.co\n\n### openai/gpt-oss-120b · Hugging Face\n\nWe’re on a journey to advance and democratize artificial intelligence through open source and open science.\n\nThat changes the problem.\n\nFor a dense model, a rough mental model is:\n\n\n    Can I keep most/all weights in fast VRAM?\n    If yes, NVIDIA dGPU probably wins.\n\n\nFor a big MoE model, the mental model is more like:\n\n\n    Which tensors are always active?\n    Which routed experts are active only some of the time?\n    Which parts are on GPU?\n    Which parts are on CPU/system memory?\n    How much KV cache is allocated?\n    How much PCIe traffic happens per token?\n    Does the backend understand this placement well?\n\n\nThat is why I would be cautious here. Strix Halo is not beating 3090 VRAM on raw bandwidth. It is much slower than GDDR6X in that sense. But it has a large unified memory pool. For a huge MoE model with offload-like behavior, that can matter more than the simple VRAM bandwidth comparison suggests.\n\nSo I would summarize your current numbers this way:\n\n\n    Dense model that fits in fast VRAM:\n      2x3090 wins hard. Expected.\n\n    Huge MoE model with large memory footprint:\n      not obvious. Halo may be a very good fit.\n\n    Your current 2x3090 gpt-oss result:\n      probably not a fair upper bound yet.\n\n\nThe main reason I would not trust the 8.8 tok/s number as the final answer is the command:\n\n\n    ./llama.cpp/build2/bin/llama-cli \\\n      -m /root/gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \\\n      -c 128000 \\\n      -fa on \\\n      -ngl 23 \\\n      -sm row \\\n      -ts 1,1\n\n\nSeveral things there make the 2x3090 setup look worse than it might be.\n\nFirst:\n\n\n    -sm row\n\n\nThe current llama.cpp multi-GPU docs describe `row` as deprecated. They describe `layer` as the default and most compatible mode, and `tensor` as experimental but intended for lower token-generation latency where the model/backend/interconnect cooperate.\n\ngithub.com/ggml-org/llama.cpp\n\n#### docs/multi-gpu.md\n\nmaster\n\n\n    # Using multiple GPUs with llama.cpp\n\n    This guide explains how to run [llama.cpp](https://github.com/ggml-org/llama.cpp) across more than one GPU. It covers the split modes, the command-line flags that control them, the limitations you need to know about, and ready-to-use recipes for `llama-cli` and `llama-server`.\n\n    The CLI arguments listed here are the same for both tools - or most llama.cpp binaries for that matter.\n\n    ---\n\n    ## When you need multi-GPU\n\n    Reach for multi-GPU when one of these is true:\n\n    - **The model doesn't fit in a single GPU's VRAM.** By spreading the weights across two or more GPUs the whole model can stay on accelerators. Otherwise part of the model will need to be run off of the comparatively slower system RAM.\n    - **You want more throughput.** By distributing the computation across multiple GPUs, each individual GPU has to do less work. This can result in better prefill and/or token generation performance, depending on the split mode and interconnect speed vs. the speed of an individual GPU.\n\n    ---\n\n    ## The split modes\n\n    Set with `--split-mode` / `-sm`.\n\n\nThis file has been truncated. show original\n\nSo I would not use `row` as the baseline for deciding whether the 2x3090 machine can replace the Halo.\n\nSecond:\n\n\n    -ngl 23\n\n\nThat is not a “try to use as much VRAM as possible” setting. It limits how many layers are offloaded to GPU. For a first baseline I would try:\n\n\n    --n-gpu-layers 999\n\n\nor:\n\n\n    --n-gpu-layers all\n\n\nand only then back down if it does not fit.\n\nThird:\n\n\n    -c 128000\n\n\nThat is a brutal starting point for 48GB total VRAM. It means you are not only testing model throughput. You are also testing huge KV cache pressure, offload behavior, and memory placement all at once.\n\nI would not start at 128k context. I would sweep context size:\n\n\n    -c 8192\n    -c 16384\n    -c 32768\n    -c 65536\n    -c 128000\n\n\nIf the 2x3090 setup is fine at 8k/16k/32k and then collapses at 64k/128k, that tells you something very different from “2x3090 is slow”.\n\nFourth, the rental VM is a big unknown.\n\nFor multi-GPU inference, topology can matter a lot:\n\n\n    PCIe layout\n    P2P availability\n    NCCL availability\n    NUMA placement\n    CPU memory bandwidth\n    virtualization overhead\n\n\nOn a rented VM, you may not know whether the two 3090s are attached in a sane way. I would at least check:\n\n\n    nvidia-smi topo -m\n    ./llama-cli --list-devices\n\n\nand watch the llama.cpp logs for things like:\n\n\n    NCCL is unavailable, multi GPU performance will be suboptimal\n\n\nor any sign that much more of the model is on CPU than expected.\n\nThe other important point is that `gpt-oss-120b` should probably be treated as a MoE placement problem, not just a normal `-ngl` problem.\n\nThis guide explains the general idea well:\n\nhuggingface.co\n\n### Performant local mixture-of-experts CPU inference with GPU acceleration in...\n\nA Blog post by Doctor Shotgun on Hugging Face\n\nFor MoE offload, the interesting idea is:\n\n\n    Always-active tensors:\n      use them every token\n      highest priority to keep on GPU\n\n    Routed experts:\n      huge part of the model\n      only a subset used per token\n      may be more reasonable to offload partly to CPU/system memory\n\n\nSo for `gpt-oss-120b`, I would test MoE-aware options if your llama.cpp build supports them:\n\n\n    --cpu-moe\n\n\nor sweep:\n\n\n    --n-cpu-moe 32\n    --n-cpu-moe 30\n    --n-cpu-moe 28\n    --n-cpu-moe 26\n    --n-cpu-moe 24\n\n\nI would not assume those are the correct values. I would sweep and measure.\n\nA more useful first baseline might look like this:\n\n\n    ./llama-cli \\\n      -m /root/gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \\\n      -c 8192 \\\n      -fa on \\\n      --n-gpu-layers 999 \\\n      --split-mode layer \\\n      --tensor-split 1,1\n\n\nThen test MoE placement:\n\n\n    ./llama-cli \\\n      -m /root/gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \\\n      -c 8192 \\\n      -fa on \\\n      --n-gpu-layers 999 \\\n      --split-mode layer \\\n      --tensor-split 1,1 \\\n      --n-cpu-moe 28\n\n\nThen sweep:\n\n\n    --n-cpu-moe 32\n    --n-cpu-moe 30\n    --n-cpu-moe 28\n    --n-cpu-moe 26\n    --n-cpu-moe 24\n\n\nThen raise context:\n\n\n    -c 8192\n    -c 16384\n    -c 32768\n    -c 65536\n    -c 128000\n\n\nIf `tensor` split is supported for this model/build, I would test it too, but separately:\n\n\n    --split-mode tensor\n\n\nI would not mix all variables at once.\n\nMy preferred order would be:\n\n\n    1. layer split, small context, max GPU layers\n    2. layer split, small context, MoE offload sweep\n    3. layer split, context sweep\n    4. tensor split, small context\n    5. tensor split, MoE offload sweep\n    6. tensor split, context sweep\n\n\nAnd I would measure prompt processing and token generation separately.\n\nFor coding use, both matter:\n\n\n    Prompt processing:\n      big prompts, repo snippets, logs, RAG, long context\n\n    Token generation:\n      how fast the answer streams back\n\n\nIf one setup has great token generation but terrible prompt processing, it may feel bad for coding with long files. If another setup has slower generation but handles large context smoothly, it may be better for actual work.\n\nSo I would benchmark something like:\n\n\n    same prompt\n    same context size\n    same output token count\n    same sampling\n    same llama.cpp commit\n    same quant\n    same backend\n    same command except the variable under test\n\n\nFor example:\n\n\n    # 8k baseline\n    -c 8192\n\n    # 32k realistic coding/RAG-ish baseline\n    -c 32768\n\n    # 128k stress test\n    -c 128000\n\n\nI would also separate these model classes:\n\n\n    Class A:\n      dense model that fits in one 3090\n\n    Class B:\n      dense model that needs both 3090s but mostly stays in VRAM\n\n    Class C:\n      huge MoE / offload-heavy / high-context model\n\n\nYour results already show why this matters.\n\nFor Class A / B, 2x3090 is likely excellent.\n\nFor Class C, Strix Halo may be surprisingly good.\n\nThat is the real point here. The systems are not exact substitutes.\n\nThe 3090 box is basically:\n\n\n    fast VRAM\n    CUDA ecosystem\n    great for dense models\n    great for gaming\n    great for smaller/faster coding models\n    annoying when the model exceeds VRAM\n    multi-GPU complexity when using both cards\n\n\nThe Halo box is basically:\n\n\n    much larger unified memory\n    less raw bandwidth than 3090 VRAM\n    much easier for very large models\n    possibly very good for MoE/offload-heavy workloads\n    not as strong for dense models that fit in NVIDIA VRAM\n\n\nThat matches your numbers.\n\nSo if your real daily workload is:\n\n\n    Qwen / Llama / DeepSeek Coder style dense models\n    20B-34B\n    70B Q4\n    gaming\n    CUDA tools\n\n\nthen I would lean toward the 3090 box.\n\nIf your real daily workload is:\n\n\n    gpt-oss-120b\n    large context\n    MoE experiments\n    \"make the huge model fit without fighting placement all day\"\n\n\nthen I would keep the Halo unless a tuned 2x3090 test proves otherwise.\n\nIf you can keep both, I think the cleanest split is:\n\n\n    Halo:\n      gpt-oss-120b\n      huge context\n      MoE/offload experiments\n      large-memory local inference\n\n    3090 box:\n      dense coding models\n      fast small/medium models\n      CUDA backends\n      gaming\n\n\nIf you want to replace the Halo, I would want the 2x3090 box to pass a more controlled `gpt-oss-120b` test first.\n\nSomething like:\n\n\n    No row split\n    Smaller context first\n    --n-gpu-layers 999/all\n    MoE offload sweep\n    Topology check\n    Prompt processing and generation measured separately\n    Same prompt / output length / sampling\n    Bare metal if possible, not unknown rental VM topology\n\n\nIf after that the 2x3090 setup gets close to or beats the Halo in your real `gpt-oss-120b` use case, then replacing the Halo becomes much more reasonable.\n\nIf it still loses badly, then I would not treat that as a surprise. It would just mean that `gpt-oss-120b` is landing in the exact niche where Strix Halo’s large unified memory is useful.\n\nOne more way to phrase it:\n\n\n    2x3090 is probably the better dense-model machine.\n\n    Strix Halo may be the better \"large weird model\" machine.\n\n    gpt-oss-120b is a large weird model.\n\n\nSo my answer to your original question would be:\n\n**Yes for many coding models, not proven for gpt-oss-120b.**\n\nAnd based on the numbers you posted, I would not sell the Halo yet.\n",
  "title": "Practical match for 128Gb Strix Halo with 2x3090s? (inference for coding)"
}