Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigadnc2jopedfap4qanr725aeps3363w7jhlxgstokqfphqylpb4e",
    "uri": "at://did:plc:5opbpi2nomj4y3d5kpwamkrd/app.bsky.feed.post/3mnrssk2dbu22"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreidlfxj2opwpkrbvpnvkfglsppttvxwszitwjmuyyrn2mg2ko3jr6u"
    },
    "mimeType": "image/png",
    "size": 601296
  },
  "description": "A field report from getting a local LLM inference endpoint working on an NVIDIA DGX Spark (GB10 / SM121, 128 GB unified memory) — including every wall I hit with gpt-oss-120B, why a smaller FP8 model sidestepped all of them, and how to expose the result safely through an nginx reverse proxy on a multihomed server.\n\nTL;DR: On a GB10 Spark, the quantization format matters more than raw capability. gpt-oss-120B ships in MXFP4, which has no native hardware support on SM121 and runs through fragile s",
  "path": "/why-qwen3-6-35b-runs-on-a-nvidia-dgx-spark-and-gpt-oss-120b-fought-me-every-step/",
  "publishedAt": "2026-06-08T13:43:06.000Z",
  "site": "https://corti.com",
  "textContent": "A field report from getting a local LLM inference endpoint working on an NVIDIA DGX Spark (GB10 / SM121, 128 GB unified memory) — including every wall I hit with gpt-oss-120B, why a smaller FP8 model sidestepped all of them, and how to expose the result safely through an nginx reverse proxy on a multihomed server.\n\n**TL;DR:** On a GB10 Spark, the quantization format matters more than raw capability. gpt-oss-120B ships in MXFP4, which has no native hardware support on SM121 and runs through fragile software kernel paths; combined with the Spark's unified memory, that produced a cascade of freezes and crashes. Qwen3.6-35B-A3B in FP8 — smaller, mixture-of-experts, and on a well-supported kernel path — loaded and served cleanly on the first honest attempt.\n\n* * *\n\n## The hardware, and the two traps it sets\n\nThe DGX Spark is a GB10 Grace Blackwell machine with 128 GB of **unified** memory shared between CPU and GPU. Two architectural facts shaped everything that followed:\n\n  1. **Unified memory is shared.** vLLM's `--gpu-memory-utilization` is a fraction of the  _entire_ 128 GB pool, not a separate VRAM budget. The default is `0.9`. On a discrete GPU that only touches VRAM; here it starves the host OS.\n  2. **SM121 has no native FP4.** Blackwell-class GB10 runs FP4 weights through software decompression kernels (Marlin/CUTLASS paths). For MXFP4 models like gpt-oss, those paths are immature and version-sensitive.\n\n\n\nNeither is obvious until you trip over it. I tripped over both.\n\n* * *\n\n## The gpt-oss-120B saga\n\n### Wall 1 — the host froze at the default memory setting\n\nThe first bare `vllm serve openai/gpt-oss-120b` reserved ~90% of the unified pool (~115 GB), leaving the kernel, Docker, and sshd to fight over the remaining ~13 GB. The box stopped responding to SSH while still answering ping — classic memory starvation, not a crash. The fix is to leave the host real headroom: `--gpu-memory-utilization 0.70` (~26 GB free for the OS). On unified memory you  _never_ run the 0.9 default.\n\n### Wall 2 — \"it loaded\" is not \"it serves\"\n\nWith memory tamed, the model loaded and idled happily at ~74 GB used. Then the first inference request wedged the entire host. Loading and serving are different phases with different failure modes, and the first decode is where the GB10-specific kernel problems actually bite.\n\n### Wall 3 — the MXFP4-on-SM121 problem (the real one)\n\nThis is the crux. gpt-oss-120B's weights are MXFP4, and on SM121 vLLM's default backend selection lands on a kernel path that hangs or crashes on first decode. The community has converged on workarounds, but they're entangled with a specific  _patched_ build of vLLM + FlashInfer. On the stock NVIDIA NGC container, those workarounds don't all apply, which produced a string of secondary failures:\n\n  * `unrecognized arguments: --mxfp4-layers` — that flag exists only in the patched build; stock vLLM 0.21.0 rejects it.\n  * `FLASHINFER ... attention sinks not supported` — gpt-oss uses attention sinks, and the stock container's FlashInfer can't do them, so forcing that backend aborted load. (The patched build compiles its own FlashInfer that can.)\n  * `Unknown vLLM environment variable: VLLM_MXFP4_BACKEND` — the marlin-backend env var simply isn't read by this build.\n\n\n\nEach \"fix\" from a recipe written against the patched build was a flag the stock container didn't understand.\n\n### Wall 4 — the memory spike the budget doesn't count\n\nOnce the flags were stripped back to what stock vLLM accepts, the model loaded and idled at ~74 GB with ~46 GB free — stable. Then the first request did this:\n\n\n    18:55:56   used 76.3G   avail 45.4G\n    18:55:58   used 78.0G   avail 43.7G\n    18:56:00   used 89.8G   avail 31.9G\n    18:56:02   used 110.3G  avail 11.4G\n    18:56:04   used 121.3G  avail  0.4G   <== host starved\n\n\nA ~47 GB spike on top of the resident model, in six seconds. The cause: **CUDA graph capture plus torch.compile firing on the first forward pass** — and that memory is  _not_ counted against `--gpu-memory-utilization`. So 0.70 left 46 GB headroom, the spike wanted more, and the host died. The lever that kills it is `--enforce-eager`, which disables graph capture and compilation (at a real throughput cost). That's the trade I'd make to get 120B stable on the stock container — but by this point the smarter move was a different model.\n\n### A debugging aside: watch `available`, not `free`, and watch from elsewhere\n\nTwo habits saved time. First, the memory metric that matters on Linux is **available** , not **free** — during heavy file reads `free` drops toward zero as the page cache fills, while `available` (which counts reclaimable cache) stays healthy. Misreading `free` as \"almost out of memory\" sends you chasing ghosts. Second, **run your monitoring and test client from a different machine**. I was curling the endpoint over SSH  _on the Spark itself_ , so when it froze, my client and my shell died with it. A laptop-side memory watcher that streams `free`/avail and reconnects on drop turns \"it froze\" into a timestamped, observable event.\n\n* * *\n\n## Why Qwen3.6-35B-A3B-FP8 just worked\n\nSwitching to Qwen3.6-35B-A3B-FP8 removed every one of those failure classes at once, for three structural reasons:\n\n  * **FP8, not MXFP4.** FP8 runs on a well-supported kernel path on SM121; vLLM auto-selects a working MoE backend and the model just loads. None of the Marlin/CUTLASS/FlashInfer-sinks drama applies.\n  * **It fits with room to spare.** At ~35 GB of weights against 121 GB, even the CUDA-graph capture spike fits inside the headroom — so there's no first-inference freeze, and you don't even need `--enforce-eager`.\n  * **It's a fast MoE.** 35B total but only ~3B active parameters per token, so on the bandwidth-bound Spark it decodes quickly for its quality. Benchmarks on Spark report roughly 28–30 tok/s single-stream, scaling to ~150+ tok/s aggregate under concurrency.\n\nQwen/Qwen3.6-35B-A3B-FP8 reasoning about binary sorting algorithm\n\nThe lesson generalizes: on GB10, prefer FP8 (or a quantization with a mature SM121 kernel) over MXFP4, and prefer a model that fits comfortably over one that maxes the unified pool. A 35B FP8 MoE is a far better daily driver here than a 120B MXFP4 model that needs a patched stack and an eager-mode throughput penalty just to stay upright.\n\nNVIDIA DGX Dashboard for Spark while inferencing.\n\n* * *\n\n## The working setup\n\n### `.env` — secrets, kept out of the compose file\n\nTwo secrets: the Hugging Face token (a **read** token is enough — you're only downloading) and the vLLM API key (the bearer token clients must present). Keep them in a `.env` beside the compose file; Docker Compose auto-loads it for`${VAR}` substitution.\n\n\n    cd ~/docker/vllm/qwen36\n    cat > .env <<'EOF'\n    HF_TOKEN=hf_your_read_token\n    VLLM_API_KEY=sk-replace-with-a-strong-key\n    EOF\n    chmod 600 .env\n    echo '.env' >> .gitignore        # never commit it\n\n\nGenerate a strong API key with `echo \"sk-$(openssl rand -hex 32)\"`.\n\n### Fetch the model up front, then share it into the container\n\nDon't let the first `vllm serve` do a multi-gigabyte download as part of startup — stage it once, then mount the cache into the container (\"download once, mount everywhere\"). On a managed Ubuntu/DGX OS box, install the CLI in isolation (system Python is externally managed):\n\n\n    sudo apt install -y pipx && pipx ensurepath\n    pipx install \"huggingface_hub[cli]\"\n    pipx inject huggingface_hub hf_transfer       # faster large downloads\n    export HF_HUB_ENABLE_HF_TRANSFER=1\n\n    hf auth login                                  # paste the read token\n    hf download Qwen/Qwen3.6-35B-A3B-FP8           # lands in ~/.cache/huggingface\n\n\nRun large downloads inside `tmux` so they survive a dropped SSH session (downloads are resumable — re-running `hf download` continues where it left off).\n\nThe sharing mechanism is a single volume mount: bind the host cache to the container's cache path. vLLM then finds the weights locally and starts fast, with no network fetch at serve time:\n\n\n        volumes:\n          - ~/.cache/huggingface:/root/.cache/huggingface\n\n\n### `compose.yml`\n\n\n    services:\n      vllm:\n        image: nvcr.io/nvidia/vllm:26.05.post1-py3\n        container_name: vllm-qwen36\n        gpus: all\n        network_mode: host\n        ipc: host\n        shm_size: \"16gb\"\n        environment:\n          - HF_TOKEN=${HF_TOKEN}\n          - VLLM_API_KEY=${VLLM_API_KEY}        # bearer token clients must send\n        volumes:\n          - ~/.cache/huggingface:/root/.cache/huggingface\n        command: >\n          vllm serve Qwen/Qwen3.6-35B-A3B-FP8\n          --host 0.0.0.0\n          --port 8000\n          --tensor-parallel-size 1\n          --gpu-memory-utilization 0.70\n          --max-model-len 32768\n          --kv-cache-dtype fp8\n          --max-num-batched-tokens 8192\n          --enable-prefix-caching\n          --trust-remote-code\n          --enable-auto-tool-choice\n          --tool-call-parser qwen3_coder\n          --reasoning-parser qwen3\n        restart: \"no\"        # flip to unless-stopped once you trust it\n\n\nNotes that matter:\n\n  * **No`--quantization` flag** — FP8 is auto-detected from the repo. (Passing MXFP4-specific flags here is what broke gpt-oss on stock vLLM.)\n  * **No`--enforce-eager`** — the model is small enough that CUDA graphs fit, so you keep full speed. Only add it back if memory climbs on first inference.\n  * **`VLLM_API_KEY` as an env var**, not the `--api-key` flag, so the secret doesn't show up in `ps`.\n  * A harmless log warning about \"no optimized MoE config for GB10\" is expected; it runs fine on auto-tuned defaults.\n\n\n\nBring it up and smoke-test it (from another machine):\n\n\n    HF_TOKEN=... VLLM_API_KEY=... docker compose up -d\n    docker compose logs -f                # wait for the Uvicorn \"listening\" line\n\n    curl http://spark:8000/v1/chat/completions \\\n      -H \"Authorization: Bearer $VLLM_API_KEY\" \\\n      -H \"Content-Type: application/json\" \\\n      -d '{\"model\":\"Qwen/Qwen3.6-35B-A3B-FP8\",\"messages\":[{\"role\":\"user\",\"content\":\"12*17\"}]}'\n\n\n* * *\n\n## Exposing it: nginx reverse proxy on a multihomed server\n\nThe clean topology: a multihomed box (one foot on the internet, one on the intranet) runs nginx, terminates TLS, and forwards inward to the Spark over the LAN. vLLM stays intranet-only and never faces the public internet directly.\n\n### Start HTTP-only, let certbot add TLS\n\nDon't hand-write `ssl_certificate` paths before a cert exists — `nginx -t` will fail on the missing files. Deploy an HTTP-only server block first, then let certbot edit it in place.\n\n\n    # /etc/nginx/sites-available/inference.yourdomain.com\n    upstream vllm_backend {\n        server 192.168.0.50:8000;     # spark's INTRANET IP (see the .local note)\n        keepalive 32;\n    }\n\n    server {\n        listen 80;\n        listen [::]:80;\n        server_name inference.yourdomain.com;\n\n        client_max_body_size 64m;     # long prompts exceed the 1m default\n\n        location /v1/ {\n            proxy_pass http://vllm_backend;\n            proxy_http_version 1.1;\n            proxy_set_header Connection \"\";\n            proxy_set_header Host $host;\n            proxy_set_header X-Real-IP $remote_addr;\n            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;\n            proxy_set_header X-Forwarded-Proto $scheme;\n            # client's \"Authorization: Bearer <VLLM_API_KEY>\" is forwarded as-is\n\n            # CRITICAL for token streaming (SSE):\n            proxy_buffering off;\n            proxy_cache off;\n            proxy_set_header X-Accel-Buffering no;\n\n            # LLM generations run long; don't cut them off at 60s:\n            proxy_connect_timeout 60s;\n            proxy_send_timeout    3600s;\n            proxy_read_timeout    3600s;\n        }\n\n        location = /health {\n            proxy_pass http://vllm_backend/health;\n            access_log off;\n        }\n    }\n\n\n\n    sudo ln -s /etc/nginx/sites-available/inference.yourdomain.com /etc/nginx/sites-enabled/\n    sudo nginx -t && sudo systemctl reload nginx\n    sudo certbot --nginx -d inference.yourdomain.com     # adds listen 443 ssl, real cert paths, redirect\n\n\ncertbot rewrites this server block: adds `listen 443 ssl;`, fills in the actual `/etc/letsencrypt/live/inference.yourdomain.com/...` paths it creates, adds the HTTP→HTTPS redirect, and installs a renewal timer. Your proxy settings survive.\n\n### The four things that bite you when proxying an LLM\n\n  1. **Streaming.** vLLM streams tokens as Server-Sent Events. nginx's default `proxy_buffering on` holds the whole response until the end — streaming appears broken. `proxy_buffering off` (plus `proxy_cache off`) fixes it.\n  2. **Timeouts.** A long generation blows past the 60s default `proxy_read_timeout` and gets chopped mid-stream. Raise it.\n  3. **Body size.** Long prompts exceed the 1 MB `client_max_body_size` default.\n  4. **`.local` resolution.** nginx resolves upstream names at startup via the system resolver, and mDNS `.local` often isn't on that path. Pin the intranet **IP** in the `upstream` block (or add a `/etc/hosts` entry).\n\n\n\n### One more trap: duplicate upstream\n\nAn `upstream` block lives in the global `http{}` context, so its name must be unique across  _every_ file nginx loads. After certbot ran, I hit:\n\n\n    [emerg] duplicate upstream \"vllm_backend\" in .../inference.yourdomain.com\n\n\nThe cause was two enabled site files both defining `vllm_backend` — a stale placeholder config alongside the real one. The fix: define the upstream in exactly one enabled file. Find them all with `grep -Rn 'upstream vllm_backend' /etc/nginx/`, remove the stale symlink from `sites-enabled`, and reload. (If you genuinely need several server blocks sharing one backend, move the `upstream{}` into its own `conf.d/*.conf` and remove it from the server files.)\n\nAfter the reload, the public endpoint works end to end:\n\n\n    curl https://inference.yourdomain.com/v1/models -H \"Authorization: Bearer $VLLM_API_KEY\"\n\n\n* * *\n\n## Lessons, distilled\n\n  * On GB10 / SM121, **quantization format trumps capability** : FP8 runs on mature kernels; MXFP4 needs a patched stack and still fights you. Choose the model that runs cleanly, not the biggest one.\n  * **Unified memory means`--gpu-memory-utilization` starves the host.** Never run the 0.9 default; leave the OS ~20+ GB.\n  * **\"Loads\" ≠ \"serves.\"** The first inference is where graph capture, compilation, and GB10 kernel issues actually surface — and graph/compile memory isn't counted in the utilization budget.\n  * **Watch`available`, not `free`,** and **monitor/test from a separate machine** so a freeze is observable rather than fatal to your session.\n  * **Stock NGC container ≠ community patched build.** Recipes written for one fail on the other; match flags to the build you're actually running.\n  * For exposure, **terminate TLS at a multihomed nginx box, keep vLLM intranet-only, go HTTP-first then certbot,** and remember the LLM-proxy specifics: streaming buffering off, long timeouts, larger body size, pinned upstream IP, unique upstream name.\n\n\n\nThe payoff: a TLS-secured, token-authenticated, OpenAI-compatible endpoint backed by a fast local MoE model — running entirely on hardware that fits on a desk.",
  "title": "Why Qwen3.6-35B Runs on a NVIDIA DGX Spark and gpt-oss-120B Fought Me Every Step",
  "updatedAt": "2026-06-08T13:43:07.054Z"
}