Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifgmjl3xn7d3vny27x6hy5uaip5weqwk6v47orsfej7ywpdhqn44a",
    "uri": "at://did:plc:5opbpi2nomj4y3d5kpwamkrd/app.bsky.feed.post/3mn5hxzipacg2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreifzlr6u5fmkxaakfoafatj577ynjkbk2eufk36ci6lgcur7xu3tba"
    },
    "mimeType": "image/jpeg",
    "size": 30336
  },
  "description": "Note on the model name: OpenAI’s open-weight family ships as gpt-oss-20b and gpt-oss-120b. There is no 130B variant — this guide targets gpt-oss-120b, which is the one sized to fit the Spark’s unified memory.\n\nA practical, single-node setup guide for serving gpt-oss-120b as a local coding backend on the GB10 Grace Blackwell DGX Spark, and wiring it into Claude Code.\n\n\n1. Why this model fits the Spark\n\nThe DGX Spark has 128 GB of coherent unified LPDDR5x (~119.7 GB addressable by the GPU) but onl",
  "path": "/running-gpt-oss-120b-on-a-single-nvidia-dgx-spark-a-practical-guide/",
  "publishedAt": "2026-05-31T11:36:03.000Z",
  "site": "https://corti.com",
  "textContent": "> **Note on the model name:** OpenAI’s open-weight family ships as `gpt-oss-20b` and `gpt-oss-120b`. There is no `130B` variant — this guide targets **`gpt-oss-120b`** , which is the one sized to fit the Spark’s unified memory.\n\nA practical, single-node setup guide for serving `gpt-oss-120b` as a local coding backend on the GB10 Grace Blackwell DGX Spark, and wiring it into Claude Code.\n\n* * *\n\n## 1. Why this model fits the Spark\n\nThe DGX Spark has **128 GB of coherent unified LPDDR5x** (~119.7 GB addressable by the GPU) but only **~273 GB/s of memory bandwidth**. Token generation is bandwidth-bound, so bandwidth — not capacity — is the limiting factor.\n\n`gpt-oss-120b` is a good match for two reasons:\n\n  * **It fits.** In its native **MXFP4** weight format the full model loads into the ~120 GB unified pool with room left for KV cache.\n  * **It’s a sparse MoE.** The model has ~117B total parameters but activates only ~5.1B per token. Generation speed scales with _active_ parameters against bandwidth, so it runs far faster than a dense model of comparable footprint.\n\n\n\nFor reference, on the same box a dense ~32B model is bandwidth-starved (~9–10 tok/s), while small-active MoE models run several times faster. Published `gpt-oss-120b` results on the Spark land around **~50 tokens/s** on an optimized engine (SGLang), which is usable for an interactive coding agent.\n\n> **Rule of thumb for the Spark:** prefer MoE models with low active-parameter counts; avoid large dense models.\n\n* * *\n\n## 2. Prerequisites\n\nRequirement | Detail\n---|---\nHardware | NVIDIA DGX Spark (GB10), 128 GB unified memory\nOS | DGX OS (Ubuntu-based, ARM64 / `aarch64`)\nGPU stack | CUDA + drivers preinstalled on DGX OS; Blackwell compute capability `sm_121`\nFirmware | Update to a current firmware version before serving (see §6)\nDisk | The 120B weights are large (~60+ GB on disk); the 4 TB NVMe is fine, but watch free space if you keep multiple quants\nAccess | A Hugging Face account + access token for `openai/gpt-oss-120b`\n\nSet your token once:\n\n\n    export HF_TOKEN=\"hf_xxxxxxxxxxxxxxxxx\"\n\n\n* * *\n\n## 3. Pick an inference engine\n\nThree viable paths, from easiest to highest-throughput. **All three serve an HTTP API** you can point a client at.\n\nEngine | Effort | API exposed | Best for\n---|---|---|---\n**Ollama** | Lowest | OpenAI-compatible | Quick start, single user\n**llama.cpp** | Medium | OpenAI-compatible | Control, tuning, GGUF quants\n**SGLang** | Higher | OpenAI-compatible (+ Anthropic-compatible via proxy) | Best measured throughput on Spark\n\n> Community testing on the Spark consistently recommends **llama.cpp or SGLang over Ollama** for throughput on this hardware. Use Ollama to confirm everything works, then move to llama.cpp/SGLang for daily use.\n\n* * *\n\n## 4. Option A — Ollama (fastest to first token)\n\n\n    # Pull and run; Ollama fetches the official MXFP4 build\n    ollama pull gpt-oss:120b\n    ollama run gpt-oss:120b\n\n\nOllama exposes an OpenAI-compatible endpoint at `http://localhost:11434/v1`.\n\nCaveats:\n\n  * Ollama defaults to a **4096-token context**. Raise it for real coding work (see model/Modelfile context settings).\n  * Performance is acceptable for testing but typically below a tuned llama.cpp/SGLang setup.\n\n\n\n* * *\n\n## 5. Option B — llama.cpp (recommended for control)\n\nBuild llama.cpp with CUDA support for the Blackwell GPU, then serve a GGUF build of the model.\n\n\n    ~/llama.cpp/build/bin/llama-server \\\n      -m ~/.cache/llama.cpp/gpt-oss-120b/gpt-oss-120b.gguf \\\n      -c 16384 \\          # context length — tune to your workload (see notes)\n      -ngl 999 \\          # offload all layers to the Blackwell GPU\n      --flash-attn on \\   # enable flash attention\n      --no-mmap \\         # see mmap note below\n      --kv-unified \\      # single shared KV buffer\n      --jinja \\           # use the model's chat template\n      -ub 2048 \\          # micro-batch size for prompt processing\n      --host 0.0.0.0 \\\n      --port 8005\n\n\n**Flag rationale:**\n\n  * `-ngl 999` — force all layers onto the GPU. On unified memory this keeps everything in the fast path.\n  * `--no-mmap` — there is a **known mmap issue on the Spark** that inflates model load time (reported ~5×). Disabling mmap fixes load times.\n  * `--flash-attn on` — standard attention speedup for transformer inference.\n  * `-c` (context) — **directly trades off against memory and speed.** Larger context grows the KV cache and reduces tok/s. On a comparable small-active MoE, throughput dropped from ~20–25 tok/s at 16K context to ~15–17 tok/s at 32K. Start at 16K and only raise it if your task needs it.\n  * `-ub 2048` — larger micro-batch improves prompt-processing (prefill) throughput.\n\n\n\nEndpoint: `http://<spark-ip>:8005/v1` (OpenAI-compatible).\n\n* * *\n\n## 6. Option C — SGLang (highest measured throughput)\n\nSGLang has explicit DGX Spark support and produced the best published `gpt-oss-120b` numbers (~50 tok/s).\n\nGeneral shape (consult the current SGLang DGX Spark docs for exact flags/container):\n\n\n    # Launch the SGLang server pointing at the 120B weights\n    python -m sglang.launch_server \\\n      --model-path openai/gpt-oss-120b \\\n      --host 0.0.0.0 \\\n      --port 30000\n\n\nNotes:\n\n  * The 120B is ~6× the size of the 20B build, so **expect longer load times**.\n  * For stability on the larger model, **enabling swap memory** on the Spark is recommended.\n  * Endpoint: `http://<spark-ip>:30000/v1`.\n\n\n\n> **Firmware:** keep DGX OS current before serving. Via the DGX Dashboard, or on the CLI:\n\n* * *\n\n## 7. Verify the server\n\nOpenAI-compatible smoke test against whichever engine you started:\n\n\n    curl http://localhost:8005/v1/chat/completions \\\n      -H \"Content-Type: application/json\" \\\n      -d '{\n        \"model\": \"gpt-oss-120b\",\n        \"messages\": [{\"role\": \"user\", \"content\": \"Write a Python function that returns the nth Fibonacci number.\"}],\n        \"max_tokens\": 256\n      }'\n\n\nA coherent code response confirms the model is loaded and serving.\n\n* * *\n\n## 8. Wire it into Claude Code\n\nClaude Code speaks the **Anthropic`/v1/messages` API**, while llama.cpp/Ollama/SGLang expose an **OpenAI-compatible** API. You therefore need one of:\n\n  * **(a) An Anthropic-compatible endpoint** , exposed directly by the engine or via a bridge, **or**\n  * **(b) A translation gateway** (e.g. **LiteLLM**) that accepts Anthropic-format requests and forwards them to your OpenAI-compatible server.\n\n\n\nClaude Code is pointed at any endpoint with the `ANTHROPIC_BASE_URL` environment variable (this is the official mechanism for routing through a custom endpoint).\n\n### 8a. Direct / bridged endpoint\n\nIf your server (or a thin bridge in front of it) presents an Anthropic-shaped `/v1/messages` endpoint:\n\n\n    ANTHROPIC_BASE_URL=http://localhost:8005 \\\n    ANTHROPIC_AUTH_TOKEN=dummy \\\n    ANTHROPIC_DEFAULT_OPUS_MODEL=gpt-oss-120b \\\n    ANTHROPIC_DEFAULT_SONNET_MODEL=gpt-oss-120b \\\n    ANTHROPIC_DEFAULT_HAIKU_MODEL=gpt-oss-120b \\\n    claude\n\n\n  * `ANTHROPIC_AUTH_TOKEN` carries the bearer/gateway token (`dummy` works for an open local server that ignores auth).\n  * The `ANTHROPIC_DEFAULT_*_MODEL` variables map Claude Code’s Opus/Sonnet/Haiku tiers onto your single local model, so every tier resolves to `gpt-oss-120b`.\n\n\n\n### 8b. LiteLLM bridge (for OpenAI-only servers)\n\nRun LiteLLM in front of llama.cpp/Ollama, register the model under `claude-*` aliases, then point Claude Code at LiteLLM’s URL with the same env vars as above. This is the established pattern for using a purely OpenAI-compatible local server with Claude Code on the Spark.\n\n### Persisting and a caching gotcha\n\nAdd the variables to `~/.bashrc`/`~/.zshrc`, or to `~/.claude/settings.json` under an `env` block.\n\n**Prefix-caching note:** Claude Code injects a per-request attribution hash into the system prompt, which can defeat prefix caching and slow throughput. If your serving stack doesn’t handle this automatically, set:\n\n\n    {\n      \"env\": { \"CLAUDE_CODE_ATTRIBUTION_HEADER\": \"0\" }\n    }\n\n\nin `~/.claude/settings.json`.\n\nLaunch Claude Code and run a small prompt to confirm requests are routing to the Spark.\n\n* * *\n\n## 9. Tuning checklist\n\n  * **Context length is your main lever.** Bigger context = bigger KV cache = lower tok/s and more memory. Right-size it per task (16K is a sane default; raise deliberately).\n  * **Stay on MoE.** Don’t swap in dense models on this box expecting similar speed.\n  * **`--no-mmap`** on llama.cpp to avoid the slow-load bug.\n  * **Enable swap** for stability when loading the 120B.\n  * **One engine, one quant.** Multiple large GGUF/quant copies fill the NVMe fast.\n  * **Watch active-vs-total params** , not total size, when predicting speed.\n\n\n\n* * *\n\n## 10. Honest expectations vs. “like Opus”\n\nOn a _single_ Spark, `gpt-oss-120b` is the largest coherent, frontier-style reasoning/tool-use model that fits, and it is genuinely usable in a Claude Code loop at ~50 tok/s. It is **not** equivalent to a current frontier closed model. The open models that most directly rival top closed models on agentic coding are trillion-parameter MoEs (e.g. Kimi K2.x, DeepSeek V4-Pro, large GLM MoEs) — those do **not** fit on one Spark and would require clustering two Sparks over the ConnectX-7 200G link or different hardware.\n\nIf you want a _coding-specialized_ alternative on the same box, Qwen3-Coder variants (e.g. 30B-A3B, or Qwen3-Coder-Next in FP8/NVFP4) are smaller-active MoEs that run faster and are widely used with Claude Code on the Spark.\n\n* * *\n\n### Source anchors\n\n  * DGX Spark hardware (GB10, 128 GB unified, 273 GB/s, `sm_121`, DGX OS): NVIDIA / LMSYS / StorageReview reviews.\n  * `gpt-oss-120b` on Spark (~50 tok/s, SGLang support, fits 120 GB, swap recommendation): LMSYS DGX Spark + GPT-OSS posts, Ollama Spark performance blog.\n  * llama.cpp flags and the `--no-mmap` load-time bug, context-vs-throughput figures: community Spark engine write-ups.\n  * Dense-vs-MoE throughput contrast and “use llama.cpp / switch to MoE” guidance: NVIDIA developer forum.\n  * Claude Code routing (`ANTHROPIC_BASE_URL`, `ANTHROPIC_AUTH_TOKEN`, `ANTHROPIC_DEFAULT_*_MODEL`, `CLAUDE_CODE_ATTRIBUTION_HEADER`): Claude Code authentication docs, vLLM Claude Code integration docs, LiteLLM bridge example.\n\n",
  "title": "Running GPT-OSS-120B on a Single NVIDIA DGX Spark - A Practical Guide",
  "updatedAt": "2026-05-31T11:36:04.385Z"
}