Raw Record Source

{
  "$type": "site.standard.document",
  "path": "/devlog/008",
  "publishedAt": "2026-05-25T15:56:58Z",
  "site": "at://did:plc:mkqt76xvfgxuemlwlx6ruc3w/site.standard.publication/3khuwc44c2256",
  "textContent": "# the io migration\n\nzig 0.16 replaced the networking and concurrency primitives. `std.net`, `std.Thread.Pool`, `std.Thread.Mutex` — all gone, replaced by `std.Io`. this is the story of migrating zat and zlay to the new system, the nine crashes that followed, and what we learned about the gap between a bug and what you think the bug is.\n\n## what changed in 0.16\n\n[`std.Io`](https://ziglang.org/documentation/master/std/#std.Io) is a backend-agnostic interface for all I/O and concurrency. two backends:\n\n- [**Threaded**](https://ziglang.org/documentation/master/std/#std.Io.Threaded) — always available. `io.concurrent()` spawns OS threads.\n- [**Evented**](https://ziglang.org/documentation/master/std/#std.Io.Evented) — fiber-based. [io_uring on linux, GCD on macOS, kqueue on BSD](https://ziglang.org/devlog/2026/#2026-02-13). `io.concurrent()` creates cheap userspace coroutines.\n\nsame code runs on both. you write against `Io`, pick the backend at init, and the scheduler does the rest. [`io.async()`](https://ziglang.org/documentation/master/std/#std.Io.async) for CPU work (may run inline if no concurrency available). [`io.concurrent()`](https://ziglang.org/documentation/master/std/#std.Io.concurrent) for I/O tasks (OS threads under Threaded, fibers under Evented). `io.sleep()` is cancellation-aware. [`Io.Mutex`](https://ziglang.org/documentation/master/std/#std.Io.Mutex) integrates with the scheduler's futex.\n\nthe promise: write once, switch backends, get threads or fibers for free.\n\nthe catch: the scheduler integration means `Io.Mutex`, `Io.Condition`, and `io.sleep()` are not just synchronization primitives — they're scheduler entry points. call them from the wrong execution context and the scheduler dereferences state that doesn't exist.\n\n## the library migration\n\nzat's migration was straightforward. every networking type gained `io: std.Io` as its first init parameter:\n\n```zig\n// 0.15\nvar resolver = zat.HandleResolver.init(allocator);\nvar client = zat.XrpcClient.init(allocator, \"https://bsky.social\");\n\n// 0.16\nvar resolver = zat.HandleResolver.init(io, allocator);\nvar client = zat.XrpcClient.init(io, allocator, \"https://bsky.social\");\n```\n\nthe streaming clients got a bigger change. `connect()` + `next()` loops became `subscribe(handler)`:\n\n```zig\n// 0.15\nvar client = zat.JetstreamClient.init(allocator, .{...});\ntry client.connect();\nwhile (try client.next()) |event| { ... }\n\n// 0.16\nvar client = zat.JetstreamClient.init(io, allocator, .{...});\ntry client.subscribe(&handler);\n```\n\n`subscribe` blocks forever — reconnects with exponential backoff, rotates hosts, calls `handler.onEvent()` for each frame. the handler is `anytype` with optional `onError` and `onConnect` callbacks. cancellation propagates through `Io.Cancelable`.\n\ninternally: `std.crypto.random` → `io.random()`. `std.posix.nanosleep` → `io.sleep()`. `libc.gettimeofday` → `Io.Timestamp`. websocket.zig took `io` in its client and server init.\n\nzat and websocket.zig migrated in lockstep over a day. 203 tests pass. the library side was clean.\n\nthen we deployed the relay.\n\n## the relay migration\n\n[zlay](https://tangled.org/zzstoatzz.io/zlay) is an AT Protocol relay — ~8,400 lines of zig, ~2,750 PDS subscribers, WebSocket fan-out to downstream consumers. it was the heaviest consumer of the 0.15 API surface. migrating it to 0.16 compiled on the first try.\n\nnine crashes followed.\n\n## crash 1: SIGSEGV on startup\n\nthe build auto-selected `Io.Evented` (io_uring available on linux). the relay printed \"io backend: Evented\" and immediately segfaulted. exit code 139.\n\nthe cause: zlay's frame processing pool spawns plain `std.Thread` workers. under Evented, any `Io.Mutex` operation calls into the Uring scheduler, which accesses `Thread.current()` — a threadlocal that's only initialized on Uring-managed fibers. plain threads have it set to `null`. in ReleaseFast, `self.?` on null gives a NULL pointer, not a panic. the mutex dereferences a field at an offset from NULL.\n\nfix: force `Backend = Io.Threaded` in main.zig. we'd come back to Evented later.\n\n## crash 2: pool acquire panic\n\n30-60 seconds into processing, `unreachable` panic in `Io.Event.waitTimeout`. stack trace: `event_log.zig uidForDid → pg pool.zig acquire → Io.zig`.\n\npg.zig (the postgres driver) used `Io.Event` as a connection-available signal. `Io.Event.reset()` has an invariant in the stdlib: it assumes no pending call to `wait`. with 16 frame workers contending for 5 database connections, `set()` wakes all waiters, one calls `reset()`, others hit `unreachable`.\n\nfix (in pg.zig fork): replaced `Io.Event` with a monotonic `u32` futex counter. `release()` increments + `futexWake(1)`. `acquire()` snapshots counter + `futexWaitTimeout()` with snapshot. no `reset()`, no single-waiter constraint. also bumped pool size to 20 (was hardcoded 5).\n\n## crash 3: GPF in websocket write\n\n30-60 seconds in, general protection fault in `memcpy → Writer.zig → websocket client.zig writeFrame`.\n\nthe websocket `Client` had no write serialization. three concurrent writers:\n\n1. `pingLoop` — `writeFrame(.ping, ...)` every 30s\n2. `readLoop` auto-pong — `writeFrame(.pong, ...)` on upstream ping\n3. close path — `writeFrame(.close, ...)` on failure\n\ninterleaved frame headers corrupt the shared TLS writer state. the server-side `Conn` already had a write lock; the client was missing it.\n\nfix (in websocket.zig fork): added `_write_lock: Io.Mutex` to `Client`, acquired around both `writeAll` calls in `writeFrame()`.\n\n## crash 4: use-after-free in ping loop\n\nsame GPF stack trace as crash 3, but after the write lock fix. every ~60-90 seconds.\n\n`pingLoop` runs as an `io.concurrent` task and sleeps in 1-second increments. when the connection dies, `readLoop` returns and the defer chain runs: `ping_future.cancel(io)` then `client.deinit()`. but `pingLoop` had `io.sleep(...) catch {}` — swallowing all errors, including `error.Canceled`. so `cancel()` couldn't stop it. `deinit()` freed the client's buffers while `pingLoop` was still running.\n\nfix: `catch {}` → `catch return`. makes the ping loop cancellation-cooperative. added `isClosed()` guard before `writeFrame` as defense-in-depth.\n\n## fix 5: HTTP fallback for health probes\n\nnot a crash — a deployment failure. k8s health probes on port 3000 got 400 responses. the websocket server's handshake handler sent 400 on any non-upgrade HTTP request.\n\nfix (in websocket.zig fork): intercept `MissingHeaders`, `InvalidConnection`, `InvalidUpgrade` errors from the handshake parser. for these, re-parse as plain HTTP and dispatch to `Handler.httpFallback()` if it exists (comptime check). k8s probes hit `/_health`, get a 200.\n\n## crash 6: SIGSEGV in the resyncer\n\nafter switching back to `Io.Evented`: SIGSEGV at startup. `dmesg` showed crash addresses in `Uring.zig` at `Thread` struct field offsets from NULL — the same signature as crash 1 but in a different code path.\n\n`addr2line` traced it to the resyncer thread. it was spawned via `io.concurrent()`, which under Evented creates a fiber. but the resync work called `DiskPersist` methods that lock `self.mutex` with `pool_io` (Threaded). Threaded futex from an Evented fiber → NULL `Thread.current()` → SIGSEGV.\n\nthis was the moment the pattern became clear: **you cannot call Threaded Io primitives from Evented fibers, or vice versa.** the futex dispatch goes through the scheduler, and the scheduler has thread-local state that only exists in its own managed execution context.\n\nfix: run the resyncer on a plain `std.Thread` with `pool_io`. the thread checks a `shutdown_flag` atomic to exit.\n\n## fix 7: startup connection storm\n\n~2,750 simultaneous WebSocket connects at startup starved the io_uring submission queue. event loop couldn't process completions fast enough.\n\nfix: throttle startup — connect in batches, give the ring time to drain between waves.\n\n## crash 8: cross-Io heap corruption\n\nthe relay ran for hours, then SIGSEGV. zero downstream consumers connected. `dmesg` showed crash addresses in `Uring.zig` at `Thread` struct field offsets from NULL — same signature as crashes 1 and 6, but in steady-state operation.\n\ntwo cross-Io violations were active:\n\n1. **GC loop** — ran as an Evented fiber (`io.concurrent(gcLoop, ...)`), but called `dp.gc()` which locks a mutex with `pool_io` (Threaded) and queries postgres through `pg.Pool` (also Threaded). Threaded futex from Evented fiber → NULL deref.\n\n2. **health check endpoints** — `/_readyz`, `/_health`, `/xrpc/_health` on the metrics and API servers. executed `db.exec(\"SELECT 1\")` through the Threaded `pg.Pool` from an Evented HTTP handler context. same violation.\n\nfix: GC loop moved from `io.concurrent()` to `std.Thread.spawn()` with `pool_io`. health checks replaced with an atomic `last_db_success` timestamp — Threaded workers set it after successful queries, Evented handlers read it. no cross-Io boundary.\n\n## the cross-Io rule\n\nthe central discovery from crashes 1, 6, and 8:\n\n**`Io.Mutex`, `Io.Condition`, `io.sleep()`, and any library that uses them internally (pg.Pool, etc.) must be called from the same Io backend they were initialized with.**\n\nthe mechanism: these primitives dispatch through the Io backend's scheduler via futex. each backend has thread-local state — `Thread.current()` under Uring is a `threadlocal var self: ?*Thread = null`, only set inside `Uring.Thread.run()`. calling from outside that context dereferences NULL.\n\n```\nEvented fiber → Io.Mutex.lock(pool_io) → Threaded futex\n    → Thread.current() → threadlocal is NULL\n    → field access at offset from NULL → SIGSEGV\n```\n\nthis isn't documented in the stdlib. the API compiles and type-checks — `Io.Mutex.lock` takes any `Io`. the crash only manifests at runtime when the calling thread's execution context doesn't match the Io's backend.\n\n**safe cross-Io patterns:**\n- raw atomics (`std.atomic.Value`, `fetchAdd`, CAS)\n- `Io.Mutex.tryLock()` — non-blocking CAS, no futex\n- MPSC ring buffers with atomic spinlocks\n- atomic timestamps for health checks\n\n**unsafe cross-Io patterns:**\n- `Io.Mutex.lock()` / `lockUncancelable()` with wrong Io\n- `Io.Condition.wait()` / `signal()` / `broadcast()`\n- `io.sleep()` from wrong context\n- any library that internally uses the above (pg.Pool, etc.)\n\n## the fix: DbRequestQueue\n\n~40 call sites across the relay needed database access from Evented fibers, but `pg.Pool` requires Threaded. the initial approach — a second pool on Evented Io — failed because `netLookup` is unimplemented in Uring. three deploy attempts, three rollbacks.\n\nthe solution: an MPSC ring buffer with typed request structs.\n\n```zig\npub const DbRequest = struct {\n    callback: *const fn(*DbRequest, *DiskPersist) void,\n    done: std.atomic.Value(bool) = .{ .raw = false },\n    err: ?anyerror = null,\n\n    pub fn wait(self: *DbRequest) void {\n        while (!self.done.load(.acquire)) {\n            std.atomic.spinLoopHint();\n        }\n    }\n};\n```\n\ncallers define typed structs that embed `DbRequest` and use `@fieldParentPtr`:\n\n```zig\nconst ListActiveHostsReq = struct {\n    base: DbRequest = .{ .callback = &execute },\n    allocator: Allocator,\n    result: ?[]Host = null,\n\n    fn execute(b: *DbRequest, dp: *DiskPersist) void {\n        const self: *@This() = @fieldParentPtr(\"base\", b);\n        self.result = dp.listActiveHosts(self.allocator) catch |e| {\n            b.err = e;\n            return;\n        };\n    }\n};\n```\n\nthe queue itself: 4096 slots, CAS-based spinlock for producers (Evented fibers), 2 worker threads on `pool_io` (Threaded). workers call `req.callback(req, persist)` then `req.done.store(true, .release)`. fibers spin on `done` with `spinLoopHint()`. shutdown drain marks unprocessed requests as done with `error.ShuttingDown`.\n\nno futex. no cross-Io boundary. the queue is pure atomics — safe from any execution context.\n\nthe final architecture:\n\n```\nEvented fibers              atomic boundary              Threaded workers\n─────────────────────────── ─────────────────── ──────────────────────────\nPDS subscribers                                 DbRequestQueue (2 workers)\ndownstream consumers         DbRequest.push()      → pg.Pool queries\nbroadcast loop               ──────────────→        → DiskPersist writes\nAPI/admin handlers                                  → host ops\n                              atomic timestamp\nhealth checks ←──────────── last_db_success ←── set by workers after query\n\n                             std.Thread.spawn()\n                                                GC loop (pool_io)\n                                                resyncer (pool_io)\n                                                backfiller (pool_io)\n```\n\n## crash 9: the ghost in the fiber\n\nafter fixing crashes 1–8, the relay ran on Evented with ReleaseFast. (ReleaseSafe had a separate problem — more on that below.) it stayed up for hours at a time, processing the full AT Protocol firehose across ~2,800 PDS connections. then, every 30–90 minutes: SIGSEGV. exit code 139. no stack trace — ReleaseFast strips safety checks.\n\nthe logs showed nothing unusual. chain breaks (expected after restarts when cursor positions are stale), normal reconnection cycles, then sudden death. 13 restarts in 12 hours.\n\nwe had a separate observation that was shaping our thinking: a minimal repro ([`repro_evented.zig`](https://tangled.org/zzstoatzz.io/zlay/blob/main/scripts/repro_evented.zig)) that spawns a single fiber and returns GPFs immediately under ReleaseSafe. the crash lands in `fiber.zig:contextSwitch` → `Uring.zig:mainIdle`. so we had a confirmed fiber context-switch bug under one build mode, and a mystery SIGSEGV under another. the natural conclusion: same bug, different manifestation. ReleaseFast just hides it longer because the optimizer arranges code differently.\n\nwe spent time investigating the fiber machinery, reading disassembly of the context switch, checking for upstream fixes (fiber.zig was unchanged across 32 dev builds). we considered patching the context switch ourselves. we checked upstream Uring networking implementation status — still fully stubbed. we read the zig team's position on Evented — \"experimental,\" \"important followup work to be done.\"\n\nwe concluded the fiber machinery was broken and reverted to `Io.Threaded`. thread-per-PDS, ~2,800 threads, same as 0.15 but on the 0.16 API. the relay stopped crashing.\n\nthen we switched the build back to ReleaseSafe.\n\n```\nthread 543 panic: start index 1370 is larger than end index 1369\nwebsocket.zig/src/client/client.zig:766\n```\n\nit was never the fibers.\n\nthe websocket client's HTTP handshake reader parses response headers line by line. when it finds a `\\r`, it advances `line_start` past the `\\r\\n` to the next line. but TCP can deliver the `\\r` at the end of one read and the `\\n` at the start of the next. when that happens, `line_start` overshoots `pos`, and the next `buf[line_start..pos]` slice has start > end. under ReleaseSafe, that's a bounds-check panic with a stack trace. under ReleaseFast, there's no bounds check — it indexes into garbage memory and eventually corrupts something downstream.\n\nwith ~2,800 connections doing TLS handshakes, the probability of a TCP split landing on the exact `\\r\\n` boundary is low per-connection but high in aggregate. once every 30–90 minutes, some PDS reconnection handshake hits it.\n\nthe fix was one line:\n\n```zig\nline_start = line_end + 2;\nif (line_start > pos) break;  // ← TCP split mid-CRLF, read more\n```\n\nthis is the bug that ReleaseSafe would have caught on the first occurrence, with a stack trace pointing directly at the line. instead, we ran ReleaseFast for days, saw silent SIGSEGVs, and blamed the fiber scheduler.\n\n## the ReleaseSafe problem\n\nso why were we on ReleaseFast in the first place?\n\nbecause Evented + ReleaseSafe GPFs on startup. the minimal repro — a fiber that returns without yielding — crashes deterministically in `fiber.zig:contextSwitch`. Debug, ReleaseFast, and ReleaseSmall all pass. only ReleaseSafe triggers it. this reproduces on completely unpatched zig (our Uring networking patch is not involved).\n\ncomparing the disassembly of `Uring.idle` between modes, the difference is in how the SwitchMessage address reaches the inline asm:\n\n```\n// fiber.zig contextSwitch, x86_64:\nasm volatile (\n    \\\\ movq 0(%%rsi), %%rax    // rax = Switch.old\n    \\\\ movq 8(%%rsi), %%rcx    // rcx = Switch.new\n    ...\n    : [message_to_send] \"{rsi}\" (s),    // input: s must be in %rsi\n```\n\nunder ReleaseFast, there's a `lea` that loads the SwitchMessage stack address into `%rsi` before the asm. under ReleaseSafe, that `lea` appears to be missing — `%rsi` holds a stale value from a prior function call. the ReleaseSafe prologue adds stack probing (`__zig_probe_stack`) and a canary (`fs:0x28`), which change the code layout surrounding the inline asm. we think this is why the register allocation differs, but we're not certain — there may be something else going on.\n\nwe've written this up as a [bug report](https://tangled.org/zzstoatzz.io/zlay/blob/main/scripts/fiber_gpf_issue.md) with a standalone reproduction.\n\nthis is a real problem for the ecosystem. ReleaseSafe is the mode designed for production services that want optimization with safety checks. TigerBeetle uses it. the zig compiler's own nightlies recently switched to it. Ghostty and Bun use ReleaseFast, but both have noted they'd prefer ReleaseSafe if the performance cost were lower. for `Io.Evented` to be a viable production backend, it needs to work with ReleaseSafe.\n\n## what this means for zat\n\nthe library held up. CBOR, CAR, commit parsing, verification, multibase — all chain correctly through the Io migration. the API change was mechanical: add `io` as first parameter, thread it through.\n\none bug surfaced at the relay level: `tooBig` omission from passthrough frames. the lexicon requires the field on `#commit` events. some PDSes omit it (it's deprecated, always false). zlay's passthrough re-encoding preserved the omission. downstream consumers with strict deserialization (no `#[serde(default)]`) rejected the frames. fix: inject `tooBig: false` when missing during resequencing.\n\nthe streaming client redesign — `subscribe(handler)` instead of `connect()` + `next()` — was the right call. the handler pattern gives the library control over reconnection, backoff, and host rotation. the caller implements `onEvent` and gets reliable delivery without managing connection lifecycle.\n\nsix patches were needed against the zig stdlib or its Uring backend for zlay to run on Evented. `netLookup` is still unimplemented. the cross-Io hazard is still undocumented. but the Io abstraction itself — write once, pick your scheduler — delivered on its promise. the relay runs on Evented with ReleaseFast in production, processes the full firehose at ~2,800 PDS connections, and has been stable since the websocket fix.\n\nthe biggest lesson wasn't technical. we had a confirmed bug in the fiber context switch (the ReleaseSafe GPF) and a mystery crash in production (the SIGSEGV under ReleaseFast). we assumed they were the same bug because the symptoms overlapped — both were crashes in the Evented code path. we spent time investigating fiber machinery, reading disassembly, checking upstream. the actual bug was a one-line off-by-one in a dependency, in a function that had nothing to do with fibers.\n\nthe thing that found it was switching to ReleaseSafe. not to fix the crash — we'd already reverted to Threaded for that — but because reverting happened to re-enable the build mode that had the safety checks. the bounds check caught the real bug on the first handshake that split on `\\r\\n`.\n\nthere are two bugs here and they're both real. the websocket off-by-one was the production crash. the ReleaseSafe GPF is a separate issue that blocks Evented from running with safety checks. we'd consider [filing the latter upstream](https://tangled.org/zzstoatzz.io/zlay/blob/main/scripts/fiber_gpf_issue.md). in the meantime, ReleaseFast works, and we know what to look for when it doesn't.\n\nzat is v0.3.0-alpha. the Io parameter is the only breaking change.\n",
  "title": "the io migration"
}