Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicl7l4adt7yxy4io6kzhddvnlache67tlxcmadaytfjamajsu4rca",
    "uri": "at://did:plc:5opbpi2nomj4y3d5kpwamkrd/app.bsky.feed.post/3momv232y3iy2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreigbbbt44nuqlw3jqt5mgwgbj7aktmqkzj4pxt6y7shyx6a4dritsq"
    },
    "mimeType": "image/png",
    "size": 543637
  },
  "description": "Coding agents like Claude Code and Codex are excellent, but both are wired to a specific vendor's API. If you run your own inference stack — for cost control, data residency, or because you have GPUs sitting idle — you want an agent you can point at your endpoint. OpenCode is the cleanest fit: it's terminal-first, open source, and talks to any OpenAI-compatible API without a translation layer.\n\nThis post walks through connecting OpenCode's CLI to a self-hosted vLLM server, using NVIDIA's Nemotro",
  "path": "/connecting-opencode-to-a-self-hosted-llm-vllm-nemotron-3-super/",
  "publishedAt": "2026-06-19T08:04:59.000Z",
  "site": "https://corti.com",
  "tags": [
    "OpenCode",
    "vLLM",
    "I'm currently hosting on my 2 NVIDIA DGX Spark node cluster",
    "HF model card",
    "vLLM recipes",
    "best practices",
    "nvtop",
    "@ai-sdk",
    "@-"
  ],
  "textContent": "Coding agents like Claude Code and Codex are excellent, but both are wired to a specific vendor's API. If you run your own inference stack — for cost control, data residency, or because you have GPUs sitting idle — you want an agent you can point at  _your_ endpoint. OpenCode is the cleanest fit: it's terminal-first, open source, and talks to any OpenAI-compatible API without a translation layer.\n\nThis post walks through connecting OpenCode's CLI to a self-hosted vLLM server, using NVIDIA's `Nemotron-3-Super-120B-A12B` as the worked example.\n\nThis is the model I'm currently hosting on my 2 NVIDIA DGX Spark node cluster.\n\nThe model choice matters: it's a  _reasoning_ model with a hybrid Mamba/MoE architecture, which surfaces a few gotchas that a vanilla chat model wouldn't.\n\nEverything here generalizes to any OpenAI-compatible endpoint — substitute your own model and host.\n\n## The one thing that decides everything: API shape\n\nThere are two API \"shapes\" in the coding-agent world:\n\n  * **OpenAI Chat Completions** (`POST /v1/chat/completions`) — what vLLM, Ollama, LM Studio, and most self-hosted runtimes speak.\n  * **Anthropic Messages** (`POST /v1/messages`) — what Claude Code speaks.\n\n\n\nThis is the whole ballgame. **Claude Code cannot talk to a vLLM endpoint directly** — it needs a translation proxy (e.g. LiteLLM) that accepts Anthropic requests and re-emits them as OpenAI. **OpenCode speaks OpenAI natively** , so there's no proxy: you add a provider block and you're done. That single fact is why OpenCode is the lower-friction choice for a self-hosted setup.\n\n## Prerequisites\n\n  * A vLLM server exposing an OpenAI-compatible endpoint **with tool calling enabled** (the agent loop is dead without it).\n  * The OpenCode CLI installed (`brew install opencode`, `npm i -g opencode`, or the install script from opencode.ai).\n  * `curl` and `jq` for validation.\n\n\n\n## Step 1 — Serve the model with the  _right_ parsers\n\nFor agentic coding, two server-side parsers do the heavy lifting:\n\n  * A **tool-call parser** that extracts structured `tool_calls` from the model's raw output.\n  * A **reasoning parser** that separates chain-of-thought from the user-facing answer (only relevant for reasoning models).\n\n\n\nGet either wrong and the agent breaks in confusing ways — reasoning text leaks into tool arguments, or tool calls never get parsed at all.\n\nFor Nemotron 3 Super, NVIDIA specifies the `qwen3_coder` tool parser (yes, even though this isn't a Qwen model) and a `super_v3` / `nemotron_v3` reasoning parser. A representative single-node serve command:\n\n\n    vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \\\n      --served-model-name nvidia/nemotron-3-super \\\n      --host 0.0.0.0 --port 8000 \\\n      --trust-remote-code \\\n      --kv-cache-dtype fp8 \\\n      --max-model-len 262144 \\\n      --gpu-memory-utilization 0.85 \\\n      --enable-chunked-prefill \\\n      --enable-auto-tool-choice \\\n      --tool-call-parser qwen3_coder \\\n      --reasoning-parser nemotron_v3\n\n\n> **Authoritative flags live on the model card.** Tensor-parallel size, quantization, MoE backend, and the exact reasoning-parser invocation are model- and hardware-specific. For Nemotron the HF model card and vLLM recipes are the source of truth. Don't copy a serve command from a blog (including this one) without checking it against the card for your checkpoint and GPU.\n\nA note on **quantization** : pre-quantized NVFP4/FP8 checkpoints carry their own quant config, and vLLM auto-detects it. Forcing `--quantization fp4` is at best redundant and at worst selects a different kernel path — prefer auto-detection unless the card tells you otherwise.\n\n## Step 2 — Store the credential\n\nIf your server enforces an API key (vLLM does this when `VLLM_API_KEY` is set in its environment), OpenCode needs that key. Store it without putting it in a config file:\n\n\n    opencode auth login\n    # → scroll to \"Other\"\n    # → provider ID: myserver      (you'll reuse this exact ID in config)\n    # → paste your API key\n\n\nThis writes only the credential to `~/.local/share/opencode/auth.json`. You still have to add the provider block in Step 3.\n\n## Step 3 — Add the provider block\n\nEdit `~/.config/opencode/opencode.json` (global) or a project-local `opencode.json`:\n\n\n    {\n      \"$schema\": \"https://opencode.ai/config.json\",\n      \"provider\": {\n        \"pulsar\": {\n          \"npm\": \"@ai-sdk/openai-compatible\",\n          \"name\": \"Self-Hosted vLLM\",\n          \"options\": {\n            \"baseURL\": \"https://llm.example.internal/v1\",\n            \"apiKey\": \"{env:VLLM_API_KEY}\"\n          },\n          \"models\": {\n            \"nvidia/nemotron-3-super\": {\n              \"name\": \"Nemotron-3-Super-120B\",\n              \"limit\": { \"context\": 262144, \"output\": 32768 }\n            }\n          }\n        }\n      },\n      \"model\": \"myserver/nvidia/nemotron-3-super\"\n    }\n\n\nField-by-field:\n\n  * `**npm: \"@ai-sdk/openai-compatible\"**` — the adapter for any `/v1/chat/completions` endpoint. If a model is served via `/v1/responses` instead, use `@ai-sdk/openai`.\n  * `**options.baseURL**` — ends at `/v1`, **not** the full `/v1/chat/completions` path. The adapter appends the rest.\n  * `**options.apiKey**` — `{env:VAR}` reads from the environment at launch; `{file:~/.secrets/key}` reads from a file. Either beats a hardcoded literal. (If you used `opencode auth login`, you can omit this.)\n  * **`models` keys** — must match **exactly** what your server returns as the model ID, i.e. your `--served-model-name`. Verify with the `/v1/models` call below. OpenCode tolerates `/` in model IDs, so `nvidia/nemotron-3-super` works as a key — a case Claude Code can't handle.\n  * **`limit.context`** — see the best practices; do **not** blindly set this to your `--max-model-len`.\n  * `**model**` — sets the default; the runtime form is `providerID/modelID`, so with a slashed model ID you get the double slash `pulsar/nvidia/nemotron-3-super`.\n\n\n\n## Step 4 — Validate the endpoint before trusting it\n\nWire-checking the endpoint by hand saves you from debugging \"why is my agent weird\" later. Do it in three escalating steps.\n\n### 4a. Can I even reach the model list?\n\n\n    curl -s https://llm.example.internal/v1/models \\\n      -H \"Authorization: Bearer $VLLM_API_KEY\" | jq '.data[].id'\n\n\nThis should print your served model ID. If you get:\n\n\n    jq: error (at <stdin>:0): Cannot iterate over null (null)\n\n\n…that is **not** a model problem. It means the endpoint returned valid JSON with no `data` field — almost always a `{\"error\": ...}` body from a **401** , because the request was missing or had the wrong `Authorization` header. (If the body were unparseable HTML you'd get a  _parse_ error instead.) Add the header. To prove it's the server and not your reverse proxy, hit the node directly, bypassing TLS/nginx:\n\n\n    curl -s http://localhost:8000/v1/models -H \"Authorization: Bearer $VLLM_API_KEY\" | jq .\n\n\n### 4b. One-shot tool-call smoke test\n\nA model that lists fine can still emit malformed tool calls. This test sends a trivial `get_weather` tool and a prompt that forces a call. Point it at your  _public_ endpoint (not localhost) so it also exercises your reverse proxy's handling of POST bodies — the exact path the agent will use.\n\n\n    curl -s https://llm.example.internal/v1/chat/completions \\\n      -H \"Authorization: Bearer $VLLM_API_KEY\" \\\n      -H \"Content-Type: application/json\" \\\n      -d @- <<'JSON' | jq .\n    {\n      \"model\": \"nvidia/nemotron-3-super\",\n      \"temperature\": 1.0,\n      \"top_p\": 0.95,\n      \"max_tokens\": 1024,\n      \"tool_choice\": \"auto\",\n      \"messages\": [\n        {\"role\": \"user\", \"content\": \"What is the current weather in Zurich? Call the get_weather tool to find out.\"}\n      ],\n      \"tools\": [\n        {\n          \"type\": \"function\",\n          \"function\": {\n            \"name\": \"get_weather\",\n            \"description\": \"Get the current weather for a city.\",\n            \"parameters\": {\n              \"type\": \"object\",\n              \"properties\": {\n                \"location\": {\"type\": \"string\", \"description\": \"City name, e.g. Zurich\"},\n                \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"]}\n              },\n              \"required\": [\"location\"]\n            }\n          }\n        }\n      ]\n    }\n    JSON\n\n\n> Sampling is set to NVIDIA's recommended `temperature 1.0 / top_p 0.95`, which Nemotron's card prescribes for  _all_ tasks — reasoning, tool calling, and chat alike. Test under the same conditions your agent will run.\n\n**What a healthy response looks like:**\n\n\n    {\n      \"choices\": [\n        {\n          \"message\": {\n            \"role\": \"assistant\",\n            \"content\": null,\n            \"tool_calls\": [\n              {\n                \"id\": \"chatcmpl-tool-...\",\n                \"type\": \"function\",\n                \"function\": {\n                  \"name\": \"get_weather\",\n                  \"arguments\": \"{\\\"location\\\": \\\"Zurich\\\"}\"\n                }\n              }\n            ],\n            \"reasoning\": \"I need to get the current weather in Zurich...\"\n          },\n          \"finish_reason\": \"tool_calls\"\n        }\n      ],\n      \"system_fingerprint\": \"vllm-0.21.0+...-tp2-...\"\n    }\n\n\nThree things to read off this:\n\n  1. `finish_reason: \"tool_calls\"` and a well-formed `tool_calls[0]`.\n  2. `content: null` with the chain-of-thought isolated in a separate `reasoning` field. **This is the success signal for a reasoning model** — it proves the reasoning parser kept the thinking out of `content` and out of the tool arguments. When that separation fails, reasoning text contaminates the arguments and the agent loop breaks.\n  3. A `tp2` (or similar) tag in `system_fingerprint` confirms your tensor-parallel topology is actually live — useful when you're serving across a multi-node cluster and want to be sure it didn't silently fall back to one node.\n\n\n\n### 4c. Pass/fail in one line\n\nThe check that actually matters is that `function.arguments` is a **parseable JSON string** — malformed arguments are the classic tool-parser failure. The `fromjson` step below throws (→ FAIL) if they aren't valid JSON:\n\n\n    curl -s https://llm.example.internal/v1/chat/completions \\\n      -H \"Authorization: Bearer $VLLM_API_KEY\" \\\n      -H \"Content-Type: application/json\" \\\n      -d @- <<'JSON' | jq -e '\n        .choices[0] as $c\n        | ($c.finish_reason == \"tool_calls\")\n          and ($c.message.tool_calls | type == \"array\")\n          and ($c.message.tool_calls[0].function.name == \"get_weather\")\n          and ($c.message.tool_calls[0].function.arguments | fromjson | type == \"object\")\n      ' >/dev/null && echo \"PASS: tool_calls well-formed\" || echo \"FAIL: inspect raw response\"\n    {\n      \"model\": \"nvidia/nemotron-3-super\",\n      \"temperature\": 1.0, \"top_p\": 0.95, \"max_tokens\": 1024, \"tool_choice\": \"auto\",\n      \"messages\": [{\"role\": \"user\", \"content\": \"What is the current weather in Zurich? Call the get_weather tool to find out.\"}],\n      \"tools\": [{\"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"description\": \"Get the current weather for a city.\", \"parameters\": {\"type\": \"object\", \"properties\": {\"location\": {\"type\": \"string\"}, \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"]}}, \"required\": [\"location\"]}}}]\n    }\n    JSON\n\n\n### 4d. Multi-turn round-trip (the one people skip)\n\nA single call passing does **not** guarantee the parser handles the  _tool-result_ turn — where you feed the function's output back and the model continues. Agents do this on every step, so test it. Take the `id` from the tool call in 4b and echo it back in a `role: \"tool\"` message:\n\n\n    curl -s https://llm.example.internal/v1/chat/completions \\\n      -H \"Authorization: Bearer $VLLM_API_KEY\" \\\n      -H \"Content-Type: application/json\" \\\n      -d @- <<'JSON' | jq '.choices[0] | {finish_reason, content: .message.content}'\n    {\n      \"model\": \"nvidia/nemotron-3-super\",\n      \"temperature\": 1.0,\n      \"top_p\": 0.95,\n      \"max_tokens\": 1024,\n      \"tools\": [\n        {\"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"description\": \"Get the current weather for a city.\", \"parameters\": {\"type\": \"object\", \"properties\": {\"location\": {\"type\": \"string\"}, \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"]}}, \"required\": [\"location\"]}}}\n      ],\n      \"messages\": [\n        {\"role\": \"user\", \"content\": \"What is the current weather in Zurich? Call the get_weather tool.\"},\n        {\"role\": \"assistant\", \"content\": null, \"tool_calls\": [\n          {\"id\": \"chatcmpl-tool-REPLACE_WITH_REAL_ID\", \"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"arguments\": \"{\\\"location\\\": \\\"Zurich\\\"}\"}}\n        ]},\n        {\"role\": \"tool\", \"tool_call_id\": \"chatcmpl-tool-REPLACE_WITH_REAL_ID\", \"content\": \"{\\\"location\\\": \\\"Zurich\\\", \\\"temp_c\\\": 12, \\\"condition\\\": \\\"cloudy\\\"}\"}\n      ]\n    }\n    JSON\n\n\nA healthy result has `finish_reason: \"stop\"` and a natural-language `content` that uses the 12°C / cloudy data you handed back. If it loops — calling `get_weather` again instead of answering — the model isn't correctly consuming the tool result, which will manifest in OpenCode as an agent that repeats actions. Note: echo the assistant turn back **without** its `reasoning` field; only `content` and `tool_calls` are required.\n\nOnce 4a–4d pass, point OpenCode at it — it'll use the default model from your config, or run `/models` and select `pulsar/nvidia/nemotron-3-super`.\n\n## OpenCode in Action\n\nOnce everything is set up, using OpenCode is straightforward.\n\nIf you also install OpenCode desktop, the same settings you configured for open code cli apply.\n\nWatching the cluster with nvtop shows the model is using both nodes' GPUs while coding.\n\n## Best practices\n\n**Set`limit.context` below `--max-model-len`, not equal to it.** A model that  _advertises_ 1M context won't  _fit_ 1M tokens of KV cache at a conservative `--gpu-memory-utilization` on memory-constrained hardware. OpenCode uses `limit.context` to decide when to compact the conversation; if you tell it the theoretical max, it will pack prompts the server then rejects mid-session. Set it to a value you've verified fits end-to-end, with margin.\n\n**Give reasoning models a generous output budget.** Reasoning tokens are generated  _before_ the tool call and count against `max_tokens`. In testing, a one-argument tool call burned ~160 completion tokens, almost all of it reasoning. Real agentic steps reason far more. A stingy output limit causes `finish_reason: \"length\"` truncation  _before_ the tool call is ever emitted — which looks like a parser failure but isn't.\n\n**Pin sampling to the model card's recommendation.** Don't let the agent's defaults override what the model was tuned for. For Nemotron that's `temperature 1.0 / top_p 0.95` across the board.\n\n**Keep your secret in one place.** With `VLLM_API_KEY` enforced server-side and `{env:VLLM_API_KEY}` (or `auth.json`) client-side, that's a single shared secret. Rotating it means updating both the server environment and the client — script the rotation so they never drift.\n\n**Pin your runtime version.** Tool-call and reasoning parsers evolve fast across vLLM releases. Record the `system_fingerprint` from a known-good run; if behavior changes after an image bump, that's your first diff.\n\n**Harden the host if you serve large models on shared boxes.** A model that exhausts memory can take SSH down with it (ICMP still replies, `sshd` doesn't — the worst kind of \"is it up?\"). Protect the essentials:\n\n\n    # Keep sshd from being OOM-killed\n    sudo systemctl edit ssh   # add: [Service]\\nOOMScoreAdjust=-1000\n\n    # Userspace OOM killer that acts before the kernel's does\n    sudo apt install earlyoom && sudo systemctl enable --now earlyoom\n\n\nPair that with an external watchdog (a separate machine curling `/health` and power-cycling on N consecutive failures) so a wedged node recovers without a desk visit.\n\n## Gotchas, condensed\n\nSymptom| Cause| Fix\n---|---|---\n`jq: Cannot iterate over null` on `/v1/models`| 401 — missing/wrong `Authorization`; server returned `{\"error\": ...}` with no `data`| Add `-H \"Authorization: Bearer $VLLM_API_KEY\"`\nModel not found / wrong model in OpenCode| Config `models` key ≠ `--served-model-name`| Match exactly; confirm via `/v1/models`\n`/` in model ID rejected| You're on Claude Code, not OpenCode| OpenCode handles slashes; for Claude Code, alias the served name without `/`\n`finish_reason: \"length\"`, no tool call| Reasoning ate the output budget| Raise `max_tokens` (2048–4096)\nTool call described in prose, `tool_calls` null| Tool parser not active or wrong| Verify `--enable-auto-tool-choice` + correct `--tool-call-parser` in startup logs\nReasoning text inside tool arguments| Reasoning parser misconfigured| Use the model's prescribed reasoning parser; confirm `content`/`reasoning` are separate\n`arguments` not parseable JSON| Genuine parser/model mismatch| Re-run; if persistent, file upstream\nAgent repeats the same tool call| Tool-_result_ turn not consumed| Run the multi-turn test (4d); check `tool_call_id`echo\nQuant/kernel error at startup| Forced `--quantization` fighting the checkpoint| Drop it; let vLLM auto-detect\nOpenCode `NotFoundError`, empty options| Older OpenCode bug not forwarding provider options| Update OpenCode; ensure the provider `name`field is present\nEndpoint reachable on localhost, not via domain| Reverse proxy not forwarding `/v1/*` or the POST body| Test through the proxy explicitly; fix the `location` block\n\n## Wrap-up\n\nThe hard part of running a coding agent on your own iron isn't the agent — it's proving the  _endpoint_ behaves like a real OpenAI-compatible tool-calling server before you trust an autonomous loop to it. OpenCode keeps the agent side trivial: one provider block, native OpenAI, no proxy. Spend your effort on the four-step validation — model list, single tool call, JSON-valid arguments, and the multi-turn round-trip — and the rest is just `opencode`.",
  "title": "Connecting OpenCode to a Self-Hosted LLM (vLLM + Nemotron 3 Super)",
  "updatedAt": "2026-06-19T08:04:59.672Z"
}