{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreigadnc2jopedfap4qanr725aeps3363w7jhlxgstokqfphqylpb4e",
"uri": "at://did:plc:5opbpi2nomj4y3d5kpwamkrd/app.bsky.feed.post/3mnrssk2dbu22"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreidlfxj2opwpkrbvpnvkfglsppttvxwszitwjmuyyrn2mg2ko3jr6u"
},
"mimeType": "image/png",
"size": 601296
},
"description": "A field report from getting a local LLM inference endpoint working on an NVIDIA DGX Spark (GB10 / SM121, 128 GB unified memory) — including every wall I hit with gpt-oss-120B, why a smaller FP8 model sidestepped all of them, and how to expose the result safely through an nginx reverse proxy on a multihomed server.\n\nTL;DR: On a GB10 Spark, the quantization format matters more than raw capability. gpt-oss-120B ships in MXFP4, which has no native hardware support on SM121 and runs through fragile s",
"path": "/why-qwen3-6-35b-runs-on-a-nvidia-dgx-spark-and-gpt-oss-120b-fought-me-every-step/",
"publishedAt": "2026-06-08T13:43:06.000Z",
"site": "https://corti.com",
"textContent": "A field report from getting a local LLM inference endpoint working on an NVIDIA DGX Spark (GB10 / SM121, 128 GB unified memory) — including every wall I hit with gpt-oss-120B, why a smaller FP8 model sidestepped all of them, and how to expose the result safely through an nginx reverse proxy on a multihomed server.\n\n**TL;DR:** On a GB10 Spark, the quantization format matters more than raw capability. gpt-oss-120B ships in MXFP4, which has no native hardware support on SM121 and runs through fragile software kernel paths; combined with the Spark's unified memory, that produced a cascade of freezes and crashes. Qwen3.6-35B-A3B in FP8 — smaller, mixture-of-experts, and on a well-supported kernel path — loaded and served cleanly on the first honest attempt.\n\n* * *\n\n## The hardware, and the two traps it sets\n\nThe DGX Spark is a GB10 Grace Blackwell machine with 128 GB of **unified** memory shared between CPU and GPU. Two architectural facts shaped everything that followed:\n\n 1. **Unified memory is shared.** vLLM's `--gpu-memory-utilization` is a fraction of the _entire_ 128 GB pool, not a separate VRAM budget. The default is `0.9`. On a discrete GPU that only touches VRAM; here it starves the host OS.\n 2. **SM121 has no native FP4.** Blackwell-class GB10 runs FP4 weights through software decompression kernels (Marlin/CUTLASS paths). For MXFP4 models like gpt-oss, those paths are immature and version-sensitive.\n\n\n\nNeither is obvious until you trip over it. I tripped over both.\n\n* * *\n\n## The gpt-oss-120B saga\n\n### Wall 1 — the host froze at the default memory setting\n\nThe first bare `vllm serve openai/gpt-oss-120b` reserved ~90% of the unified pool (~115 GB), leaving the kernel, Docker, and sshd to fight over the remaining ~13 GB. The box stopped responding to SSH while still answering ping — classic memory starvation, not a crash. The fix is to leave the host real headroom: `--gpu-memory-utilization 0.70` (~26 GB free for the OS). On unified memory you _never_ run the 0.9 default.\n\n### Wall 2 — \"it loaded\" is not \"it serves\"\n\nWith memory tamed, the model loaded and idled happily at ~74 GB used. Then the first inference request wedged the entire host. Loading and serving are different phases with different failure modes, and the first decode is where the GB10-specific kernel problems actually bite.\n\n### Wall 3 — the MXFP4-on-SM121 problem (the real one)\n\nThis is the crux. gpt-oss-120B's weights are MXFP4, and on SM121 vLLM's default backend selection lands on a kernel path that hangs or crashes on first decode. The community has converged on workarounds, but they're entangled with a specific _patched_ build of vLLM + FlashInfer. On the stock NVIDIA NGC container, those workarounds don't all apply, which produced a string of secondary failures:\n\n * `unrecognized arguments: --mxfp4-layers` — that flag exists only in the patched build; stock vLLM 0.21.0 rejects it.\n * `FLASHINFER ... attention sinks not supported` — gpt-oss uses attention sinks, and the stock container's FlashInfer can't do them, so forcing that backend aborted load. (The patched build compiles its own FlashInfer that can.)\n * `Unknown vLLM environment variable: VLLM_MXFP4_BACKEND` — the marlin-backend env var simply isn't read by this build.\n\n\n\nEach \"fix\" from a recipe written against the patched build was a flag the stock container didn't understand.\n\n### Wall 4 — the memory spike the budget doesn't count\n\nOnce the flags were stripped back to what stock vLLM accepts, the model loaded and idled at ~74 GB with ~46 GB free — stable. Then the first request did this:\n\n\n 18:55:56 used 76.3G avail 45.4G\n 18:55:58 used 78.0G avail 43.7G\n 18:56:00 used 89.8G avail 31.9G\n 18:56:02 used 110.3G avail 11.4G\n 18:56:04 used 121.3G avail 0.4G <== host starved\n\n\nA ~47 GB spike on top of the resident model, in six seconds. The cause: **CUDA graph capture plus torch.compile firing on the first forward pass** — and that memory is _not_ counted against `--gpu-memory-utilization`. So 0.70 left 46 GB headroom, the spike wanted more, and the host died. The lever that kills it is `--enforce-eager`, which disables graph capture and compilation (at a real throughput cost). That's the trade I'd make to get 120B stable on the stock container — but by this point the smarter move was a different model.\n\n### A debugging aside: watch `available`, not `free`, and watch from elsewhere\n\nTwo habits saved time. First, the memory metric that matters on Linux is **available** , not **free** — during heavy file reads `free` drops toward zero as the page cache fills, while `available` (which counts reclaimable cache) stays healthy. Misreading `free` as \"almost out of memory\" sends you chasing ghosts. Second, **run your monitoring and test client from a different machine**. I was curling the endpoint over SSH _on the Spark itself_ , so when it froze, my client and my shell died with it. A laptop-side memory watcher that streams `free`/avail and reconnects on drop turns \"it froze\" into a timestamped, observable event.\n\n* * *\n\n## Why Qwen3.6-35B-A3B-FP8 just worked\n\nSwitching to Qwen3.6-35B-A3B-FP8 removed every one of those failure classes at once, for three structural reasons:\n\n * **FP8, not MXFP4.** FP8 runs on a well-supported kernel path on SM121; vLLM auto-selects a working MoE backend and the model just loads. None of the Marlin/CUTLASS/FlashInfer-sinks drama applies.\n * **It fits with room to spare.** At ~35 GB of weights against 121 GB, even the CUDA-graph capture spike fits inside the headroom — so there's no first-inference freeze, and you don't even need `--enforce-eager`.\n * **It's a fast MoE.** 35B total but only ~3B active parameters per token, so on the bandwidth-bound Spark it decodes quickly for its quality. Benchmarks on Spark report roughly 28–30 tok/s single-stream, scaling to ~150+ tok/s aggregate under concurrency.\n\nQwen/Qwen3.6-35B-A3B-FP8 reasoning about binary sorting algorithm\n\nThe lesson generalizes: on GB10, prefer FP8 (or a quantization with a mature SM121 kernel) over MXFP4, and prefer a model that fits comfortably over one that maxes the unified pool. A 35B FP8 MoE is a far better daily driver here than a 120B MXFP4 model that needs a patched stack and an eager-mode throughput penalty just to stay upright.\n\nNVIDIA DGX Dashboard for Spark while inferencing.\n\n* * *\n\n## The working setup\n\n### `.env` — secrets, kept out of the compose file\n\nTwo secrets: the Hugging Face token (a **read** token is enough — you're only downloading) and the vLLM API key (the bearer token clients must present). Keep them in a `.env` beside the compose file; Docker Compose auto-loads it for`${VAR}` substitution.\n\n\n cd ~/docker/vllm/qwen36\n cat > .env <<'EOF'\n HF_TOKEN=hf_your_read_token\n VLLM_API_KEY=sk-replace-with-a-strong-key\n EOF\n chmod 600 .env\n echo '.env' >> .gitignore # never commit it\n\n\nGenerate a strong API key with `echo \"sk-$(openssl rand -hex 32)\"`.\n\n### Fetch the model up front, then share it into the container\n\nDon't let the first `vllm serve` do a multi-gigabyte download as part of startup — stage it once, then mount the cache into the container (\"download once, mount everywhere\"). On a managed Ubuntu/DGX OS box, install the CLI in isolation (system Python is externally managed):\n\n\n sudo apt install -y pipx && pipx ensurepath\n pipx install \"huggingface_hub[cli]\"\n pipx inject huggingface_hub hf_transfer # faster large downloads\n export HF_HUB_ENABLE_HF_TRANSFER=1\n\n hf auth login # paste the read token\n hf download Qwen/Qwen3.6-35B-A3B-FP8 # lands in ~/.cache/huggingface\n\n\nRun large downloads inside `tmux` so they survive a dropped SSH session (downloads are resumable — re-running `hf download` continues where it left off).\n\nThe sharing mechanism is a single volume mount: bind the host cache to the container's cache path. vLLM then finds the weights locally and starts fast, with no network fetch at serve time:\n\n\n volumes:\n - ~/.cache/huggingface:/root/.cache/huggingface\n\n\n### `compose.yml`\n\n\n services:\n vllm:\n image: nvcr.io/nvidia/vllm:26.05.post1-py3\n container_name: vllm-qwen36\n gpus: all\n network_mode: host\n ipc: host\n shm_size: \"16gb\"\n environment:\n - HF_TOKEN=${HF_TOKEN}\n - VLLM_API_KEY=${VLLM_API_KEY} # bearer token clients must send\n volumes:\n - ~/.cache/huggingface:/root/.cache/huggingface\n command: >\n vllm serve Qwen/Qwen3.6-35B-A3B-FP8\n --host 0.0.0.0\n --port 8000\n --tensor-parallel-size 1\n --gpu-memory-utilization 0.70\n --max-model-len 32768\n --kv-cache-dtype fp8\n --max-num-batched-tokens 8192\n --enable-prefix-caching\n --trust-remote-code\n --enable-auto-tool-choice\n --tool-call-parser qwen3_coder\n --reasoning-parser qwen3\n restart: \"no\" # flip to unless-stopped once you trust it\n\n\nNotes that matter:\n\n * **No`--quantization` flag** — FP8 is auto-detected from the repo. (Passing MXFP4-specific flags here is what broke gpt-oss on stock vLLM.)\n * **No`--enforce-eager`** — the model is small enough that CUDA graphs fit, so you keep full speed. Only add it back if memory climbs on first inference.\n * **`VLLM_API_KEY` as an env var**, not the `--api-key` flag, so the secret doesn't show up in `ps`.\n * A harmless log warning about \"no optimized MoE config for GB10\" is expected; it runs fine on auto-tuned defaults.\n\n\n\nBring it up and smoke-test it (from another machine):\n\n\n HF_TOKEN=... VLLM_API_KEY=... docker compose up -d\n docker compose logs -f # wait for the Uvicorn \"listening\" line\n\n curl http://spark:8000/v1/chat/completions \\\n -H \"Authorization: Bearer $VLLM_API_KEY\" \\\n -H \"Content-Type: application/json\" \\\n -d '{\"model\":\"Qwen/Qwen3.6-35B-A3B-FP8\",\"messages\":[{\"role\":\"user\",\"content\":\"12*17\"}]}'\n\n\n* * *\n\n## Exposing it: nginx reverse proxy on a multihomed server\n\nThe clean topology: a multihomed box (one foot on the internet, one on the intranet) runs nginx, terminates TLS, and forwards inward to the Spark over the LAN. vLLM stays intranet-only and never faces the public internet directly.\n\n### Start HTTP-only, let certbot add TLS\n\nDon't hand-write `ssl_certificate` paths before a cert exists — `nginx -t` will fail on the missing files. Deploy an HTTP-only server block first, then let certbot edit it in place.\n\n\n # /etc/nginx/sites-available/inference.yourdomain.com\n upstream vllm_backend {\n server 192.168.0.50:8000; # spark's INTRANET IP (see the .local note)\n keepalive 32;\n }\n\n server {\n listen 80;\n listen [::]:80;\n server_name inference.yourdomain.com;\n\n client_max_body_size 64m; # long prompts exceed the 1m default\n\n location /v1/ {\n proxy_pass http://vllm_backend;\n proxy_http_version 1.1;\n proxy_set_header Connection \"\";\n proxy_set_header Host $host;\n proxy_set_header X-Real-IP $remote_addr;\n proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;\n proxy_set_header X-Forwarded-Proto $scheme;\n # client's \"Authorization: Bearer <VLLM_API_KEY>\" is forwarded as-is\n\n # CRITICAL for token streaming (SSE):\n proxy_buffering off;\n proxy_cache off;\n proxy_set_header X-Accel-Buffering no;\n\n # LLM generations run long; don't cut them off at 60s:\n proxy_connect_timeout 60s;\n proxy_send_timeout 3600s;\n proxy_read_timeout 3600s;\n }\n\n location = /health {\n proxy_pass http://vllm_backend/health;\n access_log off;\n }\n }\n\n\n\n sudo ln -s /etc/nginx/sites-available/inference.yourdomain.com /etc/nginx/sites-enabled/\n sudo nginx -t && sudo systemctl reload nginx\n sudo certbot --nginx -d inference.yourdomain.com # adds listen 443 ssl, real cert paths, redirect\n\n\ncertbot rewrites this server block: adds `listen 443 ssl;`, fills in the actual `/etc/letsencrypt/live/inference.yourdomain.com/...` paths it creates, adds the HTTP→HTTPS redirect, and installs a renewal timer. Your proxy settings survive.\n\n### The four things that bite you when proxying an LLM\n\n 1. **Streaming.** vLLM streams tokens as Server-Sent Events. nginx's default `proxy_buffering on` holds the whole response until the end — streaming appears broken. `proxy_buffering off` (plus `proxy_cache off`) fixes it.\n 2. **Timeouts.** A long generation blows past the 60s default `proxy_read_timeout` and gets chopped mid-stream. Raise it.\n 3. **Body size.** Long prompts exceed the 1 MB `client_max_body_size` default.\n 4. **`.local` resolution.** nginx resolves upstream names at startup via the system resolver, and mDNS `.local` often isn't on that path. Pin the intranet **IP** in the `upstream` block (or add a `/etc/hosts` entry).\n\n\n\n### One more trap: duplicate upstream\n\nAn `upstream` block lives in the global `http{}` context, so its name must be unique across _every_ file nginx loads. After certbot ran, I hit:\n\n\n [emerg] duplicate upstream \"vllm_backend\" in .../inference.yourdomain.com\n\n\nThe cause was two enabled site files both defining `vllm_backend` — a stale placeholder config alongside the real one. The fix: define the upstream in exactly one enabled file. Find them all with `grep -Rn 'upstream vllm_backend' /etc/nginx/`, remove the stale symlink from `sites-enabled`, and reload. (If you genuinely need several server blocks sharing one backend, move the `upstream{}` into its own `conf.d/*.conf` and remove it from the server files.)\n\nAfter the reload, the public endpoint works end to end:\n\n\n curl https://inference.yourdomain.com/v1/models -H \"Authorization: Bearer $VLLM_API_KEY\"\n\n\n* * *\n\n## Lessons, distilled\n\n * On GB10 / SM121, **quantization format trumps capability** : FP8 runs on mature kernels; MXFP4 needs a patched stack and still fights you. Choose the model that runs cleanly, not the biggest one.\n * **Unified memory means`--gpu-memory-utilization` starves the host.** Never run the 0.9 default; leave the OS ~20+ GB.\n * **\"Loads\" ≠ \"serves.\"** The first inference is where graph capture, compilation, and GB10 kernel issues actually surface — and graph/compile memory isn't counted in the utilization budget.\n * **Watch`available`, not `free`,** and **monitor/test from a separate machine** so a freeze is observable rather than fatal to your session.\n * **Stock NGC container ≠ community patched build.** Recipes written for one fail on the other; match flags to the build you're actually running.\n * For exposure, **terminate TLS at a multihomed nginx box, keep vLLM intranet-only, go HTTP-first then certbot,** and remember the LLM-proxy specifics: streaming buffering off, long timeouts, larger body size, pinned upstream IP, unique upstream name.\n\n\n\nThe payoff: a TLS-secured, token-authenticated, OpenAI-compatible endpoint backed by a fast local MoE model — running entirely on hardware that fits on a desk.",
"title": "Why Qwen3.6-35B Runs on a NVIDIA DGX Spark and gpt-oss-120B Fought Me Every Step",
"updatedAt": "2026-06-08T13:43:07.054Z"
}