Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifia2yjc3uw5df4lulsosodpylhhfzbrgsfhfrzzh7fsgrwrzilvm",
    "uri": "at://did:plc:5opbpi2nomj4y3d5kpwamkrd/app.bsky.feed.post/3mn7vehsj2oo2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreif4k3mdjd3qzlxp45myz2dhcztclvtslcj4ys37vajsapwqsjz2fm"
    },
    "mimeType": "image/png",
    "size": 1064130
  },
  "description": "The NVIDIA DGX Spark put a Grace Blackwell superchip on the desk for the price of a high-end workstation. A single unit is already a capable local-inference box — 128 GB of unified memory, FP4 tensor cores, a full NVIDIA software stack. But the feature that quietly changes the platform's ceiling is the one most people skip past at unboxing: the pair of ConnectX-7 200 GbE QSFP ports on the back. Connect two Sparks through them and you stop owning two workstations and start owning a two-node AI cl",
  "path": "/two-sparks-one-cluster-why-stacking-nvidia-dgx-spark-units-unlocks-local-frontier-scale-inference/",
  "publishedAt": "2026-06-01T10:40:41.000Z",
  "site": "https://corti.com",
  "textContent": "The NVIDIA DGX Spark put a Grace Blackwell superchip on the desk for the price of a high-end workstation. A single unit is already a capable local-inference box — 128 GB of unified memory, FP4 tensor cores, a full NVIDIA software stack. But the feature that quietly changes the platform's ceiling is the one most people skip past at unboxing: the pair of **ConnectX-7 200 GbE QSFP ports** on the back. Connect two Sparks through them and you stop owning two workstations and start owning a two-node AI cluster.\n\nThis post walks through what \"Spark Stacking\" actually does at the hardware and software level, and where it earns its keep.\n\n* * *\n\n## The one cable that makes a cluster\n\nThere is no proprietary backplane and no switch involved in a two-node setup. Each DGX Spark carries an onboard NVIDIA ConnectX-7 SmartNIC running at 200 GbE, and you link two units with a single **200G QSFP56 passive Direct Attach Copper (DAC) cable** , 0.5 m long, plugged port-to-port. No transceivers, no SFP adapters — just direct copper between two boxes sitting side by side.\n\nThat simplicity is itself an advantage. The interconnect is a point-to-point **RoCE (RDMA over Converged Ethernet)** link, which gives the two GPUs a high-throughput, low-latency path for the collective operations that distributed inference depends on. NCCL — NVIDIA's collective communication library — runs its all-reduce and all-gather traffic straight over that 200 Gb/s link while MPI handles inter-process coordination on the CPU side.\n\nOne nuance worth understanding, because it shapes expectations: on the GB10 board the ConnectX-7 is wired as two PCIe Gen5 x4 links rather than a single x8. A single x4 link is roughly 100 Gb/s, so the NIC reaches the full 200 Gb/s by aggregating both x4 paths in multi-host mode. The practical takeaway is that a single cable on a single port can carry full bandwidth, and the OS will surface four logical interface names for the two physical ports (each port has two names). It's a quirk, not a limitation — but it's the kind of detail that separates a clean bring-up from an afternoon of debugging.\n\n* * *\n\n## Advantage 1: You can run models that simply don't fit on one node\n\nThis is the headline reason to stack. A single Spark's 128 GB of unified memory already lets it hold models that would never fit in a standard GPU's VRAM — a 70B-parameter model in FP16, or a ~120B model in FP4, runs on one box. But the moment you want to go bigger, you hit a wall that no amount of quantization on a single node can climb.\n\nLinking two units aggregates the memory to **256 GB** , and that is enough to host frontier-scale models locally. NVIDIA's marquee claim for the two-node configuration is **Llama 3.1 405B in FP4** — a 405-billion-parameter model served across the pair using tensor parallelism. Large mixture-of-experts models in the ~200B–235B class (Qwen3-235B-style architectures, MiniMax-M2.5 at 229B) land in the same category: too large for one node, comfortable across two.\n\nThe important mental model: the two nodes do **not** fuse into a single 256 GB GPU. The model's weights are  _partitioned_ across both Sparks — tensor parallelism splits each layer's matrices, pipeline parallelism splits the layer stack — and the nodes exchange activations over the QSFP link every forward pass. What you gain is **capacity** : the ability to load a model whose weights plus KV cache exceed any single node's memory.\n\n* * *\n\n## Advantage 2: Tensor-parallel compute and KV-cache headroom for mid-size models\n\nStacking isn't only for 405B monsters. Even a model that fits on one node benefits from being served across two, for reasons that have nothing to do with fitting the weights:\n\n  * **More KV-cache space.** Long-context workloads and high concurrency are bottlenecked by KV-cache memory, not weights. Spreading a 120B model across two nodes frees memory on each for a larger cache, which means longer context windows and more simultaneous sequences before you hit an out-of-memory wall.\n  * **Tensor-parallel throughput.** With `--tensor-parallel-size 2` in vLLM, both Blackwell GPUs share the matrix multiplications for every token. For concurrent, batched serving this raises aggregate tokens/sec meaningfully.\n  * **Continuous batching across the cluster.** vLLM's PagedAttention and continuous batching operate over the distributed setup, so the second node contributes to serving many requests in parallel rather than sitting idle.\n\n\n\nReported figures bear this out: a ~120B-class model (GPT-OSS-120B, MXFP4) that runs around 35–50 tok/s single-stream on one node lands roughly in the 55–75 tok/s range on a stacked pair depending on the engine (vLLM, SGLang, or TensorRT-LLM), with the larger gains showing up under concurrency rather than in a single isolated request.\n\n* * *\n\n## Advantage 3: A documented, repeatable software path\n\nA clustered setup is only an advantage if it's reliable to stand up. NVIDIA publishes the full procedure — physical connection, netplan-based network configuration, passwordless SSH discovery, and a vLLM + Ray cluster launched with tensor parallelism across both nodes. The serving layer exposes an **OpenAI-compatible API** , so anything that already talks to OpenAI's endpoint — Open WebUI, a local chat frontend, an agent framework — points at the head node's `:8000/v1` and works unchanged.\n\nThe orchestration is conventional, not exotic: Ray coordinates the cluster and places the vLLM workers, a Ray dashboard gives live GPU and actor visibility, and a set of environment variables pins every collective library (`NCCL_SOCKET_IFNAME`, `UCX_NET_DEVICES`, `GLOO_SOCKET_IFNAME`, `TP_SOCKET_IFNAME`) to the high-speed QSFP interface so traffic never falls back to the slow management NIC. The same Ray-based pattern also underpins TensorRT-LLM and SGLang multi-node deployments, so the skills transfer.\n\n* * *\n\n## Advantage 4: Frontier-scale capability without the cloud\n\nFor teams whose interest in large local models is driven by data residency, privacy, or simply not metering every token through a cloud API, the two-node Spark is a compelling proposition. A pair of compact desktop units — each roughly 150 mm square — gives you a private endpoint capable of 405B-class inference, sitting under a desk, in a lab, or in a location where sending data to a third-party API is off the table. No egress, no per-token billing, no waiting on shared cloud capacity.\n\nIt's also a genuine **develop-to-deploy** path. The DGX Spark runs the same CUDA / NVIDIA AI stack as datacenter Grace Blackwell systems, so a model validated and tuned across two Sparks behaves consistently when promoted to a larger DGX deployment or the cloud. You prototype at frontier scale locally, then scale out without rewriting the stack.\n\n* * *\n\n## The honest caveat: capacity scales, single-stream speed doesn't\n\nA technical post owes you the limitation alongside the upside. The GB10's unified memory is LPDDR5x with a bandwidth around 273 GB/s **per node** , and linking two units does not pool that bandwidth — each node still reads weights at its own rate. Token generation on memory-bound autoregressive decoding is governed largely by memory bandwidth, so stacking raises the  _ceiling on model size_ far more than it raises  _single-token decode speed_. The very largest models (405B) will run, and that's remarkable for a desk-side pair, but they run at modest tokens/sec, and you'll need to constrain context length and KV-cache settings to load them at all.\n\nIn other words: stack two Sparks to run **bigger** models, to serve **more concurrent** requests, and to get **more KV-cache headroom** — not to make a single chat response stream dramatically faster. Frame the purchase around capacity and concurrency, and the two-node Spark is one of the most cost-effective ways to put frontier-scale inference on local hardware.\n\n* * *\n\n## How to set up: stacking two Sparks step by step\n\nTheory aside, here's the full bring-up. The whole process takes well under an hour, and the commands below follow NVIDIA's official  _Connect Two Sparks_ procedure and the `dgx-spark-playbooks` vLLM multi-node guide. Conventions used throughout: **Node 1 = head =`192.168.100.10`**, **Node 2 = worker =`192.168.100.11`**, multi-node interface `enP2p1s0f1np1`. Adapt IPs and the interface name to your own `ibdev2netdev` output.\n\n### Step 0 — What you need\n\n  * **2 × DGX Spark** (or an OEM GB10 variant), both on the same, up-to-date DGX OS image. Update the ConnectX-7 / `mlx5` firmware and the `dgx-spark-mlnx-hotplug` package before you start.\n  * **1 × 200G QSFP56 passive DAC cable, 0.5 m** (part number `Q56-200G-CU0-5`, or a vendor's DGX-Spark-validated equivalent). No switch, no transceivers.\n\n\n\n### Step 1 — Connect the cable\n\nPlug the DAC into **port 1 on Node 1 and the matching port 1 on Node 2** — always connect the  _same_ port number on both units, or the link won't come up. Then confirm on both nodes:\n\n\n    ibdev2netdev\n\n\nYou want one interface showing `(Up)`:\n\n\n    roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)\n    rocep1s0f1   port 1 ==> enp1s0f1np1   (Up)\n\n\nEach physical port has two names; use the `enp1...` names for configuration and ignore the `enP2p...` duplicates. If nothing shows `(Up)`, reseat the cable, verify matching ports, and reboot both nodes.\n\n### Step 2 — Match the username on both nodes\n\nThe cluster scripts assume an identical login user. Check with `whoami` on each; if they differ, create a common user (e.g. `nvidia`) on both boxes.\n\n### Step 3 — Configure the network (static IPs)\n\nWith a single cable, static netplan addresses give you a stable cluster.\n\n**Node 1:**\n\n\n    sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF\n    network:\n      version: 2\n      ethernets:\n        enp1s0f1np1:\n          addresses: [192.168.100.10/24]\n          dhcp4: no\n    EOF\n    sudo chmod 600 /etc/netplan/40-cx7.yaml\n    sudo netplan apply\n\n\n**Node 2:** identical, but with `192.168.100.11/24`. Then verify connectivity:\n\n\n    ping -c3 192.168.100.11   # from Node 1\n\n\n> If you prefer zero-config, netplan `link-local: [ ipv4 ]` on both nodes auto-assigns `169.254.x.x` addresses — convenient, but the IPs can change on reboot, which complicates a static cluster config.\n\n### Step 4 — Passwordless SSH\n\n\n    ssh-keygen -t ed25519        # if you don't already have a key\n    ssh-copy-id -i ~/.ssh/id_ed25519.pub nvidia@192.168.100.10\n    ssh-copy-id -i ~/.ssh/id_ed25519.pub nvidia@192.168.100.11\n\n\nConfirm with `ssh 192.168.100.11 hostname`. (On some images NVIDIA's `discover-sparks` script automates this discovery and key exchange.)\n\n### Step 5 — Prepare the vLLM containers\n\nOn **both** nodes: install Docker, add your user to the `docker` group, pull a Blackwell/sm100-capable NGC vLLM container (CUDA 13.0+, e.g. the `26.02-py3` image or newer), and authenticate to Hugging Face (`huggingface-cli login`) for model downloads.\n\n### Step 6 — Pin every collective library to the QSFP link\n\nThis is the step that most often makes the difference between a cluster that works and one that hangs. On **both** nodes, export:\n\n\n    export MN_IF_NAME=enP2p1s0f1np1\n    export NCCL_SOCKET_IFNAME=$MN_IF_NAME\n    export GLOO_SOCKET_IFNAME=$MN_IF_NAME\n    export TP_SOCKET_IFNAME=$MN_IF_NAME\n    export UCX_NET_DEVICES=$MN_IF_NAME\n    export OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME\n    export RAY_memory_monitor_refresh_ms=0\n    export MASTER_ADDR=192.168.100.10\n\n\nAlso set `VLLM_HOST_IP=192.168.100.10` on the head and `VLLM_HOST_IP=192.168.100.11` on the worker.\n\n### Step 7 — Start the Ray cluster\n\n**Head (Node 1):**\n\n\n    ray start --head --node-ip-address=192.168.100.10 --port=6379 --dashboard-host=0.0.0.0\n\n\n**Worker (Node 2):**\n\n\n    ray start --address=192.168.100.10:6379 --node-ip-address=192.168.100.11\n\n\nVerify from the head node — you should see two nodes and two Blackwell GPUs:\n\n\n    ray status\n\n\n### Step 8 — Serve the model with tensor parallelism\n\nStart with GPT-OSS-120B to validate the cluster end to end:\n\n\n    vllm serve openai/gpt-oss-120b \\\n      --tensor-parallel-size 2 \\\n      --host 0.0.0.0 --port 8000\n\n\nFor the maximum-capability case — Llama 3.1 405B in FP4 — keep memory in check; even 256 GB is tight, so constrain context length and KV cache:\n\n\n    vllm serve <hf-org>/Llama-3.1-405B-Instruct-FP4 \\\n      --tensor-parallel-size 2 \\\n      --max-model-len 4096 \\\n      --gpu-memory-utilization 0.92 \\\n      --kv-cache-dtype fp8 \\\n      --host 0.0.0.0 --port 8000\n\n\n### Step 9 — Test the endpoint\n\nvLLM serves an OpenAI-compatible API on the head node:\n\n\n    curl http://192.168.100.10:8000/v1/chat/completions \\\n      -H \"Content-Type: application/json\" \\\n      -d '{\"model\":\"openai/gpt-oss-120b\",\"messages\":[{\"role\":\"user\",\"content\":\"Say hello from a two-node Spark cluster.\"}]}'\n\n\nPoint any OpenAI-compatible client at `http://192.168.100.10:8000/v1`, and watch the **Ray dashboard** at `http://192.168.100.10:8265`for live GPU utilization and worker placement across both Sparks.\n\n### Quick troubleshooting\n\n  * **No`(Up)` interface / QSFP cage won't power** (`insufficient power on PCIe slot (27W)`): the known hotplug issue — toggle `dgx-spark-mlnx-hotplug`, update firmware, and reboot both nodes.\n  * **NCCL timeout or hang at model load:** `NCCL_SOCKET_IFNAME` isn't set to the QSFP interface on  _both_ nodes.\n  * **`Connection refused` on Ray join:** the worker can't reach `192.168.100.10:6379` over the QSFP link — recheck IPs and routing.\n  * **Out-of-memory at load:** flush the cache with `sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'`, then lower `--max-model-len`and `--gpu-memory-utilization`.\n\n\n\n* * *\n\n## When stacking is the right call\n\nLink two DGX Spark units if any of these describe you:\n\n  * You need to run a model that exceeds 128 GB — 405B in FP4, or a large MoE in the 200B+ class — entirely on local hardware.\n  * You're serving a 70B–120B model to multiple users and want more concurrency and longer contexts than one node's KV cache allows.\n  * You want a private, frontier-capable inference endpoint with no cloud egress and predictable cost.\n  * You're building a develop-to-deploy pipeline and want local behavior to match datacenter Grace Blackwell systems.\n\n\n\nIf your workload comfortably fits one node and you only care about fastest single-stream latency, a single Spark — or a higher-bandwidth GPU — may serve you better. But for anyone whose constraint is  _model size_ or  _concurrency_ rather than raw per-token speed, the second Spark and a 0.5 m copper cable are the cheapest path to a meaningfully larger local AI ceiling.",
  "title": "Two Sparks, One Cluster: Why Stacking NVIDIA DGX Spark Units Unlocks Local Frontier-Scale Inference",
  "updatedAt": "2026-06-01T10:40:42.209Z"
}