External Publication

Practical match for 128Gb Strix Halo with 2x3090s? (inference for coding)

Hugging Face Forums [Unofficial] May 19, 2026

So I rented a server with double 3090 and tried ro run some models. Picked a MoE one that gets offloaded and a dense one that does not.

Results (output tokens):

Qwen3.6-27B-Q8_0 (fits in 3090s):

Halo: 7.8 t/s
2x3090: 24 t/s

gpt-oss-120b-Q4_K_M (does not fit in 3090s, gets offloaded):

Halo: 56 t/s
2x3090: 8.8 t/s

Somehow this experiment did not make the choice clearer. I see people online posting way better results for gpt-oss on 2x3090s, maybe I didn’t know how to run it well.

I ran it with

root@vm6388:~#   ./llama.cpp/build2/bin/llama-cli \

  -m /root/gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \

  -c 128000 \

  -fa on \

  -ngl 23 \

  -sm row \

  -ts 1,1

Also since the rental was a VM I wasn’t able to see the mobo and memory channel count, just the CPU Xeon Gold 6246.

I have a feeling that I can replace the Halo with 2x 3090s with right tweaking. Am I right?

Discussion in the ATmosphere