Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibyxesek6xz4lwjupbamosyyrprehnsgszskuzma3n7dda3frdu6q",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mm7pj55fiqd2"
  },
  "path": "/t/practical-match-for-128gb-strix-halo-with-2x3090s-inference-for-coding/175977#post_4",
  "publishedAt": "2026-05-19T13:50:58.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "So I rented a server with double 3090 and tried ro run some models. Picked a MoE one that gets offloaded and a dense one that does not.\n\nResults (output tokens):\n\nQwen3.6-27B-Q8_0 (fits in 3090s):\n\n- Halo: 7.8 t/s\n\n- 2x3090: 24 t/s\n\ngpt-oss-120b-Q4_K_M (does not fit in 3090s, gets offloaded):\n\n- Halo: 56 t/s\n\n- 2x3090: 8.8 t/s\n\nSomehow this experiment did not make the choice clearer. I see people online posting way better results for gpt-oss on 2x3090s, maybe I didn’t know how to run it well.\n\nI ran it with\n\n\n    root@vm6388:~#   ./llama.cpp/build2/bin/llama-cli \\\n\n      -m /root/gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \\\n\n      -c 128000 \\\n\n      -fa on \\\n\n      -ngl 23 \\\n\n      -sm row \\\n\n      -ts 1,1\n\n\nAlso since the rental was a VM I wasn’t able to see the mobo and memory channel count, just the CPU Xeon Gold 6246.\n\nI have a feeling that I can replace the Halo with 2x 3090s with right tweaking. Am I right?",
  "title": "Practical match for 128Gb Strix Halo with 2x3090s? (inference for coding)"
}