Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigyzhb7eyuhnh6fokhnwfoty3wgmvrxfdyxw63ror5gxif3fccixu",
    "uri": "at://did:plc:5opbpi2nomj4y3d5kpwamkrd/app.bsky.feed.post/3mnypsgf6jvs2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreif2n6wh73inbpupwjcf5vdtpfqdunbc65lhuytrrx64vrihjfayiq"
    },
    "mimeType": "image/jpeg",
    "size": 111281
  },
  "description": "Every local-inference setup eventually hits the same wall: a model you want to run is a few gigabytes too big for the one machine you'd run it on. You have a 128 GB Mac Studio. The model wants 160 GB. You also happen to have a 128 GB DGX Spark sitting on the same network. The obvious question is whether you can staple the two together and run the thing.\n\nYou can. This post is about exactly that configuration - and about being honest, up front, about what you get and what you give up. The short v",
  "path": "/borrowing-memory-not-speed-clustering-a-mac-studio-and-a-dgx-spark-with-exo/",
  "publishedAt": "2026-06-11T07:37:58.000Z",
  "site": "https://corti.com",
  "tags": [
    "exo"
  ],
  "textContent": "Every local-inference setup eventually hits the same wall: a model you want to run is a few gigabytes too big for the one machine you'd run it on. You have a 128 GB Mac Studio. The model wants 160 GB. You also happen to have a 128 GB DGX Spark sitting on the same network. The obvious question is whether you can staple the two together and run the thing.\n\nYou can. This post is about exactly that configuration - and about being honest, up front, about what you get and what you give up. The short version: exo lets you pool the memory of both boxes into a single inference cluster, which makes the otherwise-unrunnable model runnable. It does **not** make it fast, and on this particular hardware pairing the reasons why are worth understanding before you spend an evening on it.\n\nThis is \"Option B\": use exo to borrow the Spark's memory capacity so a model that overflows the Mac can run at all. It is not the configuration you reach for when you want throughput. That distinction is the whole point.\n\n## What exo is, and the one caveat that shapes everything\n\nexo (from EXO Labs, Apache 2.0) is an open-source distributed inference framework. You run it on each device on your network; the devices discover each other automatically, exo profiles each one's compute, memory, and link bandwidth, and it shards a model across them so you can run models larger than any single device could hold. It exposes OpenAI Chat Completions, Claude Messages, OpenAI Responses, and Ollama-compatible APIs at `http://localhost:52415`, so existing clients work unchanged.\n\nHere is the caveat that governs this entire build:\n\n> **exo uses the GPU on macOS via MLX. On Linux, exo currently runs on CPU. GPU support for Linux is under development.**\n\nThe DGX Spark runs DGX OS (Ubuntu 24.04). That means under the current public release, the Spark's GB10 Blackwell GPU is **not used by exo at all**. The Spark joins the cluster as a Grace-CPU node that contributes its 128 GB of memory and its CPU cores — nothing more. The widely-shared EXO Labs demo that paired a DGX Spark with a Mac Studio for a ~2.8× speedup relied on the Spark doing  _GPU_ prefill; that path is not reproducible on the stock Linux build. If you go in expecting Blackwell acceleration from the Spark, you will be disappointed. Go in expecting a memory donor and you'll be calibrated correctly.\n\n## The topology\n\n\n            ┌────────────────────────────┐\n            │        Mac Studio          │\n            │   MLX GPU  ·  128 GB       │    ← only GPU-accelerated node\n            └────────────┬───────────────┘\n                         │ 1 GbE              ← the bottleneck\n                         │\n            ┌────────────┴───────────────┐\n            │        DGX Spark           │\n            │  CPU-only in exo · 128 GB  │   ← memory donor; GB10 GPU idle\n            └────────────────────────────┘\n\n\nTwo facts about this picture do most of the work:\n\n  1. **Only the Mac uses a GPU.** The Spark contributes CPU + RAM.\n  2. **The link between them is 1 GbE** — roughly 125 MB/s, about two orders of magnitude slower than the RDMA-over-Thunderbolt-5 interconnect exo's headline benchmarks used. exo's planner is topology-aware and will treat this link as the slow, high-latency edge it is.\n\n\n\nIf your two Sparks are joined to each other by a 200 GbE fabric, note that it does **not** help here: that link only connects Spark-to-Spark, and under exo both ends are CPU. A 200 GbE cable between two CPU inference nodes solves a problem you don't have. It's the right fabric for vLLM + Ray (which  _does_ drive the GB10 GPUs), not for an exo memory-borrow.\n\n## When Option B is the right call\n\nA simple decision rule:\n\n  * **Model fits in 128 GB →** run on the Mac alone (exo single-node, or LM Studio). Adding the Spark over 1 GbE will only slow you down. Don't cluster.\n  * **Model needs 128–256 GB →** this is the  _only_ case where adding the Spark via exo earns its keep. You're trading a large speed penalty for the ability to run the model at all.\n  * **You want fast inference across GPUs →** wrong tool. Use vLLM + Ray on the Spark(s) over the fast fabric, and keep the Mac separate.\n\n\n\nOption B is a capacity play, full stop.\n\n## Setting it up\n\nBoth nodes must be on the same network; discovery is automatic. Install exo on the Mac (the GPU node) and on the Spark (the memory donor).\n\n### On the Mac Studio\n\nThe simplest route is the prebuilt app: download `EXO-latest.dmg` from `https://assets.exolabs.net/EXO-latest.dmg` (requires macOS Tahoe 26.2 or later). It runs in the background and will ask to install a network profile.\n\nFrom source instead, if you prefer to control the build:\n\n\n    # Prerequisites: Xcode (Metal toolchain for MLX), Homebrew\n    brew install uv node\n    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\n    rustup toolchain install nightly\n\n    # macmon: install the pinned fork — Homebrew's macmon 0.6.1 crashes on M5-class chips\n    cargo install --git https://github.com/vladkens/macmon \\\n      --rev a1cd06b6cc0d5e61db24fd8832e74cd992097a7d macmon --force\n\n    git clone https://github.com/exo-explore/exo\n    cd exo/dashboard && npm install && npm run build && cd ..\n    uv run exo\n\n\n### On the DGX Spark (DGX OS / Ubuntu 24.04)\n\n\n    sudo apt update && sudo apt install -y nodejs npm\n    curl -LsSf https://astral.sh/uv/install.sh | sh\n    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\n    rustup toolchain install nightly\n\n    git clone https://github.com/exo-explore/exo\n    cd exo/dashboard && npm install && npm run build && cd ..\n    uv run exo\n\n\n`macmon` is macOS-only; skip it on the Spark.\n\n### Isolate the cluster\n\nIf the box lives on a shared network, give the cluster its own namespace so it can't accidentally merge with another exo instance:\n\n\n    EXO_LIBP2P_NAMESPACE=pulsar-exo uv run exo\n\n\nSet the same namespace on both nodes.\n\n### Point model storage somewhere with room\n\nLarge models need a writable cache with space. On Linux, exo defaults to `~/.local/share/exo/models`; you can redirect or add read-only shared stores:\n\n\n    # Additional writable dir (first one with enough free space wins)\n    EXO_MODELS_DIRS=/mnt/fast-nvme/exo-models uv run exo\n\n    # Read-only pre-downloaded models (e.g. an NFS mount you've already populated)\n    EXO_MODELS_READ_ONLY_DIRS=/mnt/nfs/models uv run exo\n\n\n### The critical step: override auto-placement\n\nThis is where Option B is won or lost. exo's default partitioning strategy is **ring memory-weighted** : it assigns layers to each device in proportion to that device's memory. With 128 GB on the Mac and 128 GB on the Spark, that default lands roughly **50/50** — which means about half your model's layers run on the slow CPU Spark. That is the worst possible split for throughput. You want the  _minimum_ number of layers on the Spark that still lets the model fit.\n\nSo don't accept the default. Preview the valid placements, inspect how much memory each lands on each node, and force a pipeline split that keeps as much as possible on the Mac:\n\n\n    # 1. Preview placements; filter out errors and look at the per-node memory deltas\n    curl \"http://localhost:52415/instance/previews?model_id=YOUR_MODEL\" \\\n      | jq '.previews[] | select(.error==null)\n            | {sharding, instance_meta, memory_delta_by_node}'\n\n\nChoose a placement where:\n\n  * `sharding` is `**Pipeline**`, not `Tensor` (more on why below), and\n  * `memory_delta_by_node` puts the largest share on the Mac (`local`) and only the overflow on the Spark.\n\n\n\nThen create that exact instance:\n\n\n    # 2. POST the chosen placement object to /instance\n    curl -X POST http://localhost:52415/instance \\\n      -H 'Content-Type: application/json' \\\n      -d '{ \"instance\": { ...the placement you picked... } }'\n\n    # 3. Run a completion\n    curl -N -X POST http://localhost:52415/v1/chat/completions \\\n      -H 'Content-Type: application/json' \\\n      -d '{ \"model\": \"YOUR_MODEL\",\n            \"messages\": [{\"role\":\"user\",\"content\":\"Hello\"}],\n            \"stream\": true }'\n\n\n## How it will perform\n\nSet expectations with the pipeline-parallel execution model rather than with hope.\n\n**Pipeline (ring) vs. tensor parallelism.** Tensor parallelism splits every layer's tensors across devices and does an all-reduce **every layer** — it is extremely sensitive to inter-node bandwidth and latency. Over a 200 GbE or Thunderbolt link it's fine; over 1 GbE it is pathological. Pipeline parallelism instead gives each device a contiguous block of layers, so data crosses the link only at the cut point(s). On a 1 GbE fabric, pipeline is the only sane choice. This is why the setup above forces `Pipeline`.\n\n**Where the time actually goes.** In a two-stage Mac→Spark pipeline:\n\n  * _Decode (token generation)_ sends a single token's hidden state across the cut — on the order of tens of KB at the cut point. The 1 GbE link transfers that almost instantly; bandwidth is **not** the decode bottleneck. The bottleneck is the **Spark's CPU computing its share of the layers** for every token. Large-model CPU inference is memory-bandwidth-bound and slow, and a pipeline runs only as fast as its slowest stage. Your tokens-per-second will be gated by that CPU stage, plus a small per-token network round-trip.\n  * _Prefill (prompt processing)_ is worse for the link. The activation crossing the cut for a prompt of length  _L_ is an `[L, hidden]` tensor. For `L = 4096` and a hidden size around 8192 in fp16, that's roughly 4096 × 8192 × 2 ≈ **64 MB per cut crossing** — about half a second on 1 GbE just to move it once, on top of the Spark CPU grinding through its layers over the whole prompt. Long prompts amplify both costs.\n\n\n\nThe net result is predictable: **substantially slower than the Mac running alone** , justified only because the alternative is the model not running at all. There is no free lunch where the Spark's memory comes without the Spark's CPU speed attached.\n\n**Measure, don't guess.** exo ships `exo-bench`, which reports prompt tokens/sec, generation tokens/sec, and peak memory per placement. Run it for both the Mac-only and Mac+Spark placements so you have real numbers for  _your_ model:\n\n\n    uv run bench/exo_bench.py \\\n      --model YOUR_MODEL \\\n      --pp 128,512,2048 \\\n      --tg 128 \\\n      --max-nodes 2 \\\n      --sharding pipeline \\\n      --repeat 3 \\\n      --json-out exo-results.json\n\n\nIf the model fits in 128 GB and you ran this comparison anyway, the data will almost always tell you to drop the Spark and stay single-node. That's the expected and correct outcome — it confirms Option B is for overflow only.\n\n## Advantages\n\n  * **It runs models that don't fit on any single box you own.** This is the entire reason to do it, and it delivers.\n  * **Fully local and private.** No data leaves your network — relevant if you're running this inside a corporate environment with data-handling constraints.\n  * **Cheap capacity.** You're using hardware you already have rather than buying a single machine with more unified memory.\n  * **Drop-in APIs.** OpenAI / Claude / Ollama compatibility means OpenWebUI, existing scripts, and agent frameworks point at `localhost:52415` and just work.\n  * **Zero-config discovery.** No manual IP wiring; nodes find each other on the LAN.\n\n\n\n## Disadvantages\n\n  * **The Spark's GPU is wasted.** Under exo on Linux you're paying for a Blackwell GPU and using a Grace CPU. This is the single biggest inefficiency of the configuration.\n  * **1 GbE is a hard ceiling on prefill.** Long-context prompts pay a real transfer tax at every cut crossing.\n  * **Throughput is gated by the slowest stage.** Pipeline parallelism means the CPU Spark sets the pace; the fast Mac spends time idle waiting.\n  * **It's alpha-grade software.** exo is moving fast and is explicitly experimental in places; expect rough edges and breaking changes between releases.\n\n\n\n## Pitfalls\n\nA concrete checklist of things that will bite you:\n\n  1. **Accepting the default memory-weighted placement.** With 128/128 it splits ~50/50 and buries half your layers on the CPU node. Always override toward Mac-heavy. This is the number-one mistake.\n  2. **Letting it pick tensor parallelism.** Over 1 GbE, tensor parallel's per-layer all-reduce will collapse throughput. Force `Pipeline`.\n  3. **Expecting CUDA acceleration from the Spark.** It won't happen on the stock Linux build. The GB10 sits idle.\n  4. **Trying to use the 200 GbE Spark↔Spark fabric for this.** It connects two CPU nodes under exo and buys you nothing here. Save it for vLLM + Ray.\n  5. **Running out of model-cache disk.** Big models need a big, fast writable cache. Set `EXO_MODELS_DIRS` to NVMe with headroom before you start a 150 GB download.\n  6. **Cluster cross-talk on a shared network.** Without `EXO_LIBP2P_NAMESPACE`, your cluster can merge with someone else's exo instance. Namespace it.\n  7. **Benchmarking once and trusting it.** Use `--repeat` and a `--warmup`; cold-cache and first-run numbers are not representative.\n  8. **Forgetting this is overflow-only.** If you find yourself clustering a model that fits in 128 GB \"because the Spark is there,\" stop — you've made it slower for no reason.\n\n\n\n## Verdict\n\nOption B does precisely one thing well: it lets a model that's too big for your Mac Studio run by borrowing the Spark's memory. Treat it as a capacity extension, force a Mac-heavy pipeline split, keep your prompts short where you can, and measure before you commit it to anything you depend on.\n\nThe moment your actual goal becomes  _throughput_ rather than  _fit_ , the answer changes entirely: put the Spark(s) on vLLM + Ray over the fast fabric so the Blackwell GPUs do real work, and run the Mac as its own MLX node for low-latency interactive use. exo and vLLM/Ray are answering different questions. Option B is the right answer to \"how do I run this oversized model locally at all\" — and the wrong answer to almost everything else.",
  "title": "Borrowing Memory, Not Speed: Clustering a Mac Studio and a DGX Spark with exo",
  "updatedAt": "2026-06-11T07:41:14.926Z"
}