Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreic7djicsfl4kxhrpuokkfvdeir7sy6dirmddo7ik2bxq34xxxg4fm",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mp22lfv6yww2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreiainlkz5mut2w6tg4tdqj6xgwhu2vz6gisd34jthh5dgaobda542i"
    },
    "mimeType": "image/webp",
    "size": 62052
  },
  "path": "/constant_chen_/sipp-a-local-first-runtime-for-hybrid-ai-applications-2ce5",
  "publishedAt": "2026-06-24T13:37:58.000Z",
  "site": "https://dev.to",
  "tags": [
    "inference",
    "ai",
    "localai",
    "llm",
    "Llamas on the Web",
    "paper",
    "https://www.sipp.sh/",
    "ONNX",
    "TVM/WebLLM",
    "Apache TVM",
    "benchmark.sipp.sh/benchmark"
  ],
  "textContent": "Over the past few months, I had the opportunity to contribute to llama.cpp’s WebGPU backend, helping push it from isolated operator support toward a more complete and reliable path for browser-based and multimodal inference. It was a collective effort with dozens of contributors, and was an essential component of getting Sipp ready for release. The lead maintainer, Reese Levine, wrote a really nice blog post about it, Llamas on the Web, and published a paper around the architecture design. In this tech post, I want to share some thoughts into the WebGPU in-browser inference, and the larger ecosystem we are building to unify both local and cloud compute in a hybrid inference design for making intelligence more available.\n\nCheck out Sipp: https://www.sipp.sh/\n\n#  What it means to make intelligence available\n\nA technology becomes interesting when it works in a lab, and becomes transformative when people can actually reach it, run it, adapt it, and build with it in the environments they already use. That is the sense in which Sipp is about making AI more available.\n\nLanguage models have mostly entered software through chatboxes, but interactive applications rarely begin with a complete prompt. In games, design tools, and agent workspaces, the useful context is often the live environment itself: the open document, selected objects, recent edits, user actions, scene state, and background work. As open models become smaller and more capable, more of that work can happen close to the user. This matters for more than speed and privacy. Useful contexts should not require every interaction to cross a cloud boundary, depend on frontier-model pricing, or be gated by a single provider. Frontier models are\nstill better for deep reasoning, and planning, but most interactive software does not need that depth at every step. The stronger design is a split one: local models for immediate, private, continuous interaction, and larger remote models when the task truly requires deeper reasoning.\n\nSipp is designed for the space between them. You can register a local model, a gateway target, or a provider endpoint, and then call the same operations against the selected endpoint. In these remaining sections, I will introduce the system design behind Sipp, including:\n\n  * how Sipp represents local, gateway, and provider endpoints\n  * why `query`, `chat`, and `embed` are separate operations\n  * how the local engine schedules requests and reuses key-value (KV) cache state\n  * how the browser host makes WebGPU practical for GGUF models\n  * how the gateway provides a policy and operations boundary for remote compute\n\n\n\n##  Architecture at a glance\n\nSipp uses endpoint registration. An endpoint can run in the same process, in the browser, behind an HTTP gateway, or at an external provider. The application keeps the endpoint reference and passes it to `query`, `chat`, or `embed`.\n\n\n\n    flowchart LR\n      App[Application] --> Client[SippClient]\n      Client --> Local[Local endpoint]\n      Client --> Gateway[Gateway endpoint]\n      Client --> Provider[Provider endpoint]\n\n      Local --> Engine[Sipp Engine]\n      Engine --> Native[llama.cpp and ggml]\n\n      Gateway --> GatewayServer[Gateway server]\n      GatewayServer --> ServerClient[SippClient]\n      ServerClient --> ServerLocal[Server local model]\n      ServerClient --> ServerProvider[Provider target]\n\n\nThe diagram shows two important properties. First, the application chooses the endpoint. Sipp does not silently move a request from a local model to a remote model. The application owns the product\nconstraints, such as privacy, cost, latency, and quality. Second, the operation shape stays stable. A feature can start with a browser model, add a server-side local model later, and then add a gateway target for harder requests without changing the public operation model.\n\n##  Endpoint model\n\nThe core client stores endpoints behind a common inference endpoint interface. `SippClient::add()` builds the endpoint from a descriptor:\n\n  * A local model descriptor loads `SippEngine`.\n  * A gateway descriptor builds an HTTP gateway endpoint.\n  * A provider descriptor builds a direct provider adapter when provider support is enabled.\n\n\n\nAfter registration, Sipp resolves operations to the selected endpoint, checks whether it supports the requested operation and returns the same run and response types. This design keeps endpoint selection explicit while keeping the application API small.\n\n##  Request operations\n\nSipp separates `query`, `chat`, and `embed` because each operation has different\nruntime requirements.\n\n`query` sends a raw prompt string, without applying a chat template. Use this\noperation when the application owns the prompt format, such as completion-style\nprompts, few-shot prompts, custom templates, encoder-decoder flows, or agent\nloops that render prompts themselves.\n\n`chat` sends ordered role and content messages. A local endpoint applies the\nmodel's chat template. A gateway or provider endpoint maps those messages into\nthe selected remote protocol.\n\n`embed` returns vectors instead of generated text. The endpoint must support\nembeddings, and the local runtime uses a different path because it reads an\nembedding result instead of sampling output tokens.\n\nThe separation prevents common integration bugs. A raw prompt does not\naccidentally become a chat transcript. A chat request is not sent to a local\nmodel without the model's template. An embedding request is not routed to a\ngeneration-only endpoint.\n\nThe gateway and provider paths also enforce this boundary. Gateway requests\nreject local-only fields such as `contextKey`, `grammar`, and local sampling overrides. Endpoint-specific options must be JSON-compatible and cannot override typed fields such as `model`, `prompt`, `messages`, or `stream`.\n\n##  Local engine\n\nThe local engine is the part of Sipp that turns endpoint calls into interactive\ninference work.\n\nThe core loop is tick-based. A request enters the runtime, becomes an internal\ngeneration or embedding request, and advances through scheduler ticks. This\nmodel fits interactive applications, where a visible chat response, a short\nbackground classification, and a longer prefill can overlap.\n\nAt each tick, the engine separates prompt prefill from token decode. The batch\nplanner builds a flat list of token contributions from ready slots. Each\ncontribution is either `Prefill` or `Decode`, and includes the slot index,\nrequest ID, token, position, and whether the request needs logits.\n\nThis split gives the scheduler direct control over latency and throughput:\n\n  * Decode steps are latency-sensitive because they produce visible tokens.\n  * Prefill steps can process many prompt tokens before the first generated token.\n  * Active decode streams should not wait behind a long prompt.\n  * Long prompts should still make progress while decode streams are active.\n\n\n\nThe planner budgets decode and prefill separately. Reusing the plan\navoids repeated allocation in the scheduler hot path. The planner also uses a\nsmall bitmask fast path for counting occupied slots instead of allocating a\n`HashSet` on each tick.\n\nAfter the native backend runs, Sipp applies request bookkeeping and emits token\nbatches. The runtime records the request metrics for observability.\n\n##  Key-value cache as runtime state\n\nEach local request can carry a `contextKey`. The key represents the logical\nworkflow, such as a document, scene, conversation, workspace, or\nbackground task. The engine uses that key to decide whether it can reuse live KV\nstate or restore a prefix snapshot.\n\nThe `KvCacheManager` maps context keys to physical sequence slots. When a request completes and cache reuse is enabled, Sipp can keep the sequence idle but resident. A later request with the same `contextKey` can use that warm state. If the runtime has more active contexts than physical sequences, it evicts idle sessions with an LRU policy. Sipp also supports prefix snapshots. During prefill, the runtime can restore the best matching snapshot for the same model fingerprint and context scope. It then recomputes only the missing suffix. The prefill path computes longest common prefix reuse, checks whether partial KV reuse is valid for the model family, makes room in the context window, and records cache hits.\n\nThis cache state matters for hybrid routing because it presents the cost of\nasking for local evidence. A warm `contextKey` can let the runtime reuse prefix\ntokens, run a short verifier, or draft an answer without paying full prefill\ncost again. A router can then decide whether to return the local result, send\nthe draft to the gateway for audit, or skip local work when the task likely needs a stronger model. For example, an editor can use the same `contextKey` while a user works inside\none file. The first request pays the cost of reading the file and recent edits into the local model. A follow-up request, such as \"is this edit safe?\", can reuse that warm prefix and run a cheap local check. If the check is confident, the UI can respond immediately. If the check is uncertain or the check needs more contexts, the router can send the task to the gateway instead.\n\nInstead of a static local-then-cloud cascade, a route can use cache hits, prefill cost, network latency, and provider cost as routing inputs. We are currently researching in this area and hopefully we can bring some insights to this topic in the future.\n\n##  Browser host\n\nThe browser host owns packaging, model staging, capability selection, and the JavaScript-facing runtime API. It does not duplicate the inference engine in the Rust core.\n\nThe build compiles the Rust browser ABI as an Emscripten static library. It then\nlinks that library with `llama.cpp`, `ggml`, `ggml-webgpu`, and the multimodal\nruntime. The `ggml-webgpu` target embeds WGSL shader files into a generated\nheader, and the Emscripten build uses Dawn's `emdawnwebgpu` port to call browser\nWebGPU from C++ and WebAssembly.\n\nThe browser client runs through a worker-backed model service or a\nmain-thread model service and ships single-thread and pthread WebAssembly\nartifacts. The pthread artifact requires `SharedArrayBuffer`, cross-origin\nisolation, and the deployment headers that enable shared memory. The\nsingle-thread artifact remains available when those requirements are not met.\n\nBackend selection is capability-aware:\n\n  * If the app requests CPU, Sipp uses CPU.\n  * If the app requests WebGPU, Sipp returns adapter information.\n  * If the app uses automatic selection, Sipp selects WebGPU only when the adapter exposes `shader-f16`; otherwise, it falls back to CPU.\n\n\n\nModel loading is part of inference performance. The browser cache policy loads\nfiles directly up to 2 GiB. It splits larger GGUF assets into 512 MiB shards and\nloads those shards automatically. For large assets, the browser package stores\nthe files in OPFS, opens sync access handles, and mounts those handles into\nEmscripten's filesystem. Read calls copy bytes from OPFS directly into a\n`Uint8Array` view of the WebAssembly heap. That avoids reading into a JavaScript\n`ArrayBuffer` and then copying the same bytes again into `HEAPU8`. Vision models use the same lifecycle. The main model weights can be staged as GGUF shards, while the projector artifact is staged separately for the multimodal runtime.\n\n##  WebGPU case study\n\nThe main difference of the WebGPU backend from ONNX and TVM/WebLLM is the representation.\nONNX treats WebGPU as an execution provider for ONNX graphs. That is\na good fit for portable graph artifacts and provider abstraction. The tradeoff\nis that GGUF-native details, such as tokenizer metadata, chat templates, KV\nbehavior, and `llama.cpp` quantized layouts, must cross a different artifact\nboundary. TVM/WebLLM uses a compiler pipeline. Model computation is lowered through\nMLC-LLM and Apache TVM into WebGPU and WebAssembly\nartifacts. That path can apply ahead-of-time optimization, graph fusion, and\noperator scheduling for a curated model catalog. The tradeoff is that users do\nnot point the runtime at an arbitrary GGUF file and run it directly.\n\nGGML WebGPU keeps the model format and runtime behavior closer to execution.\n`ggml` still builds the tensor graph dynamically, and WebGPU executes that graph\nin the browser. The backend maps tensor views to WebGPU buffer offsets, supports\nquantized layouts without expanding them into large intermediate tensors,\nspecializes shader pipelines by tensor type and quantization format, and reuses\nper-kernel parameter storage. For quantized matrix multiplication,\ndequantization stays in the shader path instead of becoming a separate\nmodel-conversion step.\n\nThat architecture fits decode-heavy workloads. Decode often depends on memory\nbandwidth because each generated token streams weights and attends over KV\nstate. Keeping quantized weights close to execution reduces memory expansion and\ncopy costs. Prefill has different tradeoffs because it exposes more parallelism\nand can benefit from compiler fusion.\n\nThrough our public browser benchmark tooling at\nbenchmark.sipp.sh/benchmark, we achieved\nthe following results with an NVIDIA GTX 3080, one warmup run, and three\nmeasured runs:\n\nRuntime or framework | Time to first token, lower is better | Decode, higher is better | End-to-end latency, lower is better\n---|---|---|---\nSipp browser runtime | 24.3 ms | 77.07 tok/s | 6,655 ms\nWebLLM | 160.0 ms | 25.80 tok/s | 19,930 ms\nTransformers.js | 301.0 ms | 33.25 tok/s | 15,670 ms\n\nThis benchmark is one data point. Browser version, model, memory pressure and other factors can change results, but these numbers are still useful because they match the architecture: fewer avoidable copies, GGUF-native execution, and quantized decode in the backend.\n\n##  Gateway control plane\n\nSipp separates gateway responsibilities into layers:\n\n  * `sipp::gateway_core` defines protocol-neutral operations, request context, cancellation, stream events, target resolution, authorization, admission, and execution traits.\n  * `lib/gateway` provides route-free HTTP helpers, including codecs, authentication traits, error translation, JSON responses, and server-sent events (SSE) encoding.\n  * `apps/gateway-server` is the Axum application. It owns TOML configuration, listeners, bearer tokens, CORS, request-size limits, concurrency limits, rate limiting, metrics, and the admin dashboard.\n\n\n\nThis separation keeps policy out of the local engine. A product can hide\nprovider secrets, restrict targets by caller, enforce request-size limits, rate\nlimit public clients, expose health routes, or run a local model on a server GPU\nwithout changing the local runtime.\n\nThe gateway server loads configured targets into a `SippClient` and\nstores a map from public target names to endpoint references. Incoming requests\nresolve a target, check authorization, acquire admission, and then execute the\nsame `query`, `chat`, or `embed` operation through `SippClient`.\n\nQuery and chat can return finite responses or streams. Streaming responses use\nSSE events: token batches, optional usage, and a final done event with finish\nmetadata. Embeddings use a finite response path.\n\nThe browser gateway client mirrors that contract. It validates the gateway base\nURL, allows HTTP only for loopback, supports bearer and header authentication\nwith value providers, redacts secrets from errors, bounds error bodies and SSE\nevent sizes, and parses token, usage, done, and error events into the same\nbrowser run abstraction used by local inference.\n\n##  Hybrid routing\n\nHybrid routing is under active research, and the decision depends on runtime\nsignals, including:\n\n  * local model latency for the current device and backend\n  * privacy requirements for the task\n  * expected reasoning depth\n  * remote target availability, cost, and authorization\n\n\n\nWith those signals, an application can start with simple policies:\n\n  * Keep UI classification, lightweight summarization, grammar-constrained extraction, scene synchronization, and private state-adjacent tasks local.\n  * Delegate long-horizon planning, difficult synthesis, stronger world knowledge, or high-accuracy tasks to a gateway target.\n  * Send the remote model only the context it needs.\n\n\n\nThe application still owns the routing decision. Sipp provides the common\nruntime model that makes the decision practical.\n\n##  Closing\n\nAt the start of Sipp, the motivation was simple: AI should be usable in more places than a remote chatbox. By the end, that had become a systems question: how do we make model compute available inside real applications, close to the user when interaction demands it, and connected to deeper reasoning when the task requires it? That is the vision Sipp is built to support.\n\nDuring this journey, I learned a lot by working close to inference itself and to the `llama.cpp` community. My work covered backend stability, shader correctness and optimization, quantized-kernel behavior, and the operator coverage needed by vision models. A challenge along the way was tracing precision drift through the CLIP vision path, attention shaders, and feed-forward layers, then fixing the shader behavior until multimodal outputs matched the CPU reference much more closely. This was not just implementation work, but required some low-level debugging across WGSL shaders, memory semantics, quantization formats, and real model validation. These contributions became a central part of making llama.cpp’s WebGPU support more stable, complete, and practically usable. The deeper lessons, however, were not only technical. They came from seeing how much careful work sits behind a model that feels simple to use, and from collaborating with people who were solving the same problems from different parts of the stack.\n\nLooking forward, Local models handle the work that benefits from being close to the user. Cloud models are available when a task needs more depth. Applications should not have to choose one side forever. They should be able to place computation where it makes the experience better. This brings the argument back to where it started: the future is not just better chatboxes, but environment-aware software that can act close to the user and reason deeply when needed. Sipp is built to make that practical.",
  "title": "Sipp: a local-first runtime for Hybrid AI Applications"
}