Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibjmol672zlfpljkazr4qu5kvzxpkmcqk6mwshxfr76ltzhpyii74",
    "uri": "at://did:plc:zxocse6iz2rn5sxn3l6aqlk6/app.bsky.feed.post/3mmqe4lb5tvb2"
  },
  "path": "/2026/05/vibe-bottlenecks",
  "publishedAt": "2026-05-26T06:22:30.508Z",
  "site": "https://macbirdie.net",
  "tags": [
    "CRDTs",
    "☉"
  ],
  "textContent": "_Three backends, three languages, and what the benchmarks taught me about my own assumptions._\n\nI’ve been building _Slips_ — a collaborative, real-time task list app, as part of a series of lightweight, self-hostable collaboration tools - with one constraint: the backend would be written primarily by an LLM, with me in the role of technical director rather than primary author. The result was three functionally identical server implementations: Node.js, Go, and Swift. Same API surface, same SQLite persistence, same WebSocket sync protocol. For the client apps, each backend must be essentially a drop-in replacement.\n\nThe experiment started as a productivity exercise. It turned into something more interesting: a systematic audit of where LLMs may fall short, what “idiomatic” code actually costs at runtime, and some wrong assumptions about language and runtime performance I had.\n\n* * *\n\n## The Setup\n\n_Slips_ is a collaborative task list with real-time sync to multiple clients over WebSockets. The HTTP API covers the usual operations: create a list, fetch it by share token, manage tasks. The share token is derived via SHA-256 from a random base64 string, so every request that looks up a list performs a hash computation. The WebSocket layer handles live sync using the Automerge library, using CRDTs.\n\nAll three backends use SQLite as the persistence layer (WAL mode, which I’ll come back to), run on the same machine (Apple M1, macOS), and were benchmarked with a Go-based tool running 200 operations at 10 concurrent workers.\n\nThe Node.js implementation came first and received the most guidance - detailed prompts, iterative corrections, explicit direction toward correct behavior, and successful manual tests. The Go and Swift versions were reimplementations using some lighter prompting: “here’s what this does, look at that fugly js code, use the target language the way it was meant to be used.”\n\n* * *\n\n## Round 1: Node.js — Getting it Working\n\nThe Node.js backend reached a working state relatively quickly. Express for HTTP, `better-sqlite3` for synchronous SQLite access, `ws` for WebSockets. The LLM’s choices were reasonable and pretty conventional. Sounded pretty good with what I’ve been using in the past that were good enough defaults to not drop down too deep a rabbit hole of dependencies.\n\nThe first benchmark session revealed numbers that, on the surface, looked acceptable: around 1,400 ops/s on sequential POSTs. Not fast, but not obviously broken either. Given that at the time the benchmarking started, all three implementations already existed and were somewhat functionally equivalent.\n\n### Web Crypto surprise\n\nThe function that derives a list ID from a share token - called on every single request - was implemented using the Web Crypto API (`crypto.subtle.digest`). This is the modern, standards-compliant way to hash data in JavaScript. It’s also asynchronous, which means every request was performing a thread pool dispatch and a Promise task just to compute a SHA-256 hash.\n\nThe fix was switching to Node’s built-in `crypto` module, specifically `createHash('sha256')` — a synchronous C++ binding that goes directly to OpenSSL with no JavaScript overhead, no allocations, no Promises. Sequential POST throughput went from ~1,153 to ~2,848 ops/s. **2.5x improvement** from one function call change. Yikes.\n\nThe same file had a secondary issue in token generation: a spread operator feeding into `btoa()` followed by three separate regex passes to clean up the base64 output - what in the holy convoluted batman is this. Replaced with a single `randomBytes(32).toString('base64url')` call. Same result, one native operation. I was quite surprised how convoluted the original implementation was. As if it just tried things until it made it work. After all, making this implementation was the most painful, and required most hand holding, since the requirements seemed not to be grasped in their entirety, even though we worked out the spec before the first code hit the runtime.\n\n### Logging overhead\n\nAt the default log level, debug output to stderr was consuming a measurable fraction of request time. With `LOG_LEVEL=error`, sequential POST throughput recovered to 2,975 ops/s — matching the historical best. This isn’t a criticism of the LLM; it’s a reminder that benchmarking with debug logging enabled is benchmarking the logger. On the other hand, the real world deployment requires logging, so benchmaxxing cannot drive such decisions. Maybe the logging layer needs some tweaks of its own.\n\n### SQLite WAL mode\n\nThe SQLite journal mode defaults to a rollback journal that serializes all reads behind active writes. WAL (Write-Ahead Logging) decouples readers and writers by appending changes to a separate file, letting readers operate against a consistent snapshot while a write is in progress. For a workload with 10 concurrent writers, this matters enormously.\n\nThe impact was measured directly on the Go backend (where it was easiest to isolate): WAL mode delivered approximately **5x improvement** on sequential writes and **5.1x on concurrent writes**. All three backends ended up with WAL enabled, though the mechanism differed — Go and Node.js required an explicit PRAGMA, while Swift’s GRDB library via `DatabasePool` enables it automatically. Why the default was not to enable WAL in the first place. Beats me.\n\n**Final Node.js numbers: 2,975 ops/s sequential POST, ~1,024 ops/s GET by token, 121.8 MB RSS, actual physical memory taken, at idle.**\n\n* * *\n\n## Round 2: Swift — Where My Assumptions Broke\n\nI expected Swift to be the fastest of the three, possibly by a significant margin. The reasoning: ARC eliminates garbage collector pauses, Swift compiles to native ARM64 machine code with aggressive optimization, Apple’s frameworks are tuned for Apple silicon. The LLM chose Hummingbird 2 for HTTP (a solid NIO-based framework) and GRDB for SQLite (which handles WAL automatically via `DatabasePool`).\n\nThe first benchmark told a different story.\n\n### The actor hop\n\nThe initial Swift implementation modeled the backend with two actors: an `API` actor handling routing logic, calling into a `Store` actor for persistence. This is a natural way to think about the architecture in Swift’s concurrency model — separate concerns, separate actors, compile-time safety.\n\nThe cost: each actor boundary is a cooperative executor suspension. Two hops per request, each carrying a ~5–15μs overhead. The benchmark result was **139 ops/s** on GET by token. Go was doing over 3,000 on the same test.\n\nThis wasn’t the LLM doing something wrong in any naive sense. The code was correct, it was idiomatic Swift 6, it compiled cleanly with strict concurrency checking. The problem was that “idiomatic” and “performant” diverged significantly at this particular boundary.\n\nThe fix was restructuring to a single actor hop: a `Sendable` class for the API layer (zero executor cost from route handlers) calling into a single `Store` actor for actual persistence. Throughput jumped to ~3,300 ops/s. The actor boundary is still there where it matters — around actual database writes — but the unnecessary intermediate hop is gone.\n\n### 33 allocations for a hex string\n\nThe SHA-256 hash of a share token needs to be hex-encoded on every request. The LLM reached for `hash.map { String(format: \"%02x\", $0) }.joined()` — a common Swift pattern that looks completely innocuous.\n\nIt isn’t. `String(format:)` routes through `CFStringCreateWithFormat`, which means Objective-C bridging. For each of the 32 bytes in a SHA-256 hash:\n\n  * The `UInt8` is boxed into an `NSNumber`\n  * `CFStringCreateWithFormat` allocates an autoreleased `CFString`\n  * That bridges back to a Swift `String` with a separate allocation\n  * The resulting string lands in an intermediate array\n\n\n\n33 heap allocations per call, all hitting the Obj-C autorelease pool. At 5,000 requests/second, that’s 165,000 unnecessary allocations per second.\n\nThe replacement: a pure-Swift lookup table mapping nibbles directly to ASCII bytes, writing into a pre-allocated `[UInt8]` buffer, then constructing the final string with a single `String(decoding:as:)` call. One allocation instead of 33. GET by token throughput improved **+57%**.\n\nThe same `String(format:)` antipattern appeared in token validation (using `CharacterSet`, which bridges to `NSCharacterSet`) and token generation (using `replacingOccurrences(of:with:)`, which bridges to `NSString`). All three were converted to pure-Swift equivalents.\n\n### Going further: unsafe buffer tricks\n\nAfter the lookup table fix, there were still two unnecessary allocations in the hot path: an intermediate `[UInt8]` buffer for the hex output, and a `Data` copy of the input token for the SHA-256 computation.\n\n`String(unsafeUninitializedCapacity:initializingUTF8With:)` writes hex characters directly into the String’s internal storage, bypassing the intermediate buffer. `withContiguousStorageIfAvailable` reads the token’s UTF-8 bytes from its internal storage without a copy (Swift 5+ stores all strings as UTF-8 internally).\n\nThe microbenchmark result across 100,000 iterations on M3:\n\nVersion | Time | Allocations\n---|---|---\nOriginal (`map + String(format:)`) | 36,844 ns/op | 33+\nLookup table with `[UInt8]` buffer | 7,467 ns/op | 3\n`String(unsafeUninitializedCapacity:)` | 6,051 ns/op | 2\n+ no-copy UTF-8 input | **5,617 ns/op** | **1**\n\nAn 87% reduction in that function’s overhead, purely from eliminating Obj-C bridging and redundant copies.\n\n**Final Swift numbers: 5,049 ops/s GET by token, 8,620 ops/s concurrent POSTs, 56.1 MB RSS. Note: Swift’s concurrent throughput varies across runs — the NIO event loop showed elevated CPU usage (up to 576% at idle) in some sessions, which dragged down benchmark scores. Historical best was 14,030 ops/s on concurrent writes.**\n\n* * *\n\n## Round 3: Go — Standing on the Shoulders of stdlib\n\nThe Go backend benefited from a combination of LLM choices that happened to align well with Go’s strengths from the start: `net/http` for HTTP, `mattn/go-sqlite3` (CGO) for SQLite, `gorilla/websocket` for WebSockets. Standard choices, well-trodden path.\n\nThe performance optimizations here were less about correcting structural mistakes and more about pushing a good baseline further.\n\n### WAL mode and CGO\n\nSwitching from `modernc.org/sqlite` (pure-Go, WASM-based SQLite) to `mattn/go-sqlite3` (CGO, native SQLite library) improved GET throughput significantly — from ~3,608 to ~6,861 ops/s on token lookups — because the CGO version has access to the full, optimized SQLite C library. Combined with WAL mode, write throughput approximately doubled.\n\n### Per-token shard mutexes\n\nThe original implementation used a single `sync.RWMutex` protecting the entire provider state. Under concurrent load, all goroutines writing to different lists were still serializing on this one lock.\n\nThe fix was splitting into 64 shard-level mutexes, keyed by FNV-1a hash of the share token. The provider-level lock is now only held briefly for map lookups; the CPU-heavy work — Automerge operations, cryptography, SQLite writes — runs under only the per-token shard lock. Different tokens can write concurrently.\n\nResults:\n\nMetric | Before | After | Change\n---|---|---|---\nPOST seq (ops/s) | 2,907 | 5,327 | **+83%**\nPOST c=10 (ops/s) | 9,122 | 15,677 | **+72%**\nGET by token (ops/s) | 4,739 | 7,536 | **+59%**\n\n### fmt.Sprintf vs hex.EncodeToString\n\nA small but measurable optimization: the original `DeriveListID` used `fmt.Sprintf(\"%x\", h)` to hex-encode the SHA-256 hash. `fmt.Sprintf` uses reflection internally to format its arguments. `hex.EncodeToString(h[:])` is a direct memory operation with a single allocation for the output string. The difference is small per call, but measurable at throughput.\n\n**Final Go numbers: 6,866 ops/s GET by token, 12,122 ops/s concurrent POSTs, 44.5 MB RSS.**\n\n* * *\n\n## The Final Comparison\n\nTested 2026-05-25 on Apple M1 (arm64), macOS 26.0. All three backends in the same session, fresh starts, clean databases.\n\nMetric | Go | Swift | Node.js\n---|---|---|---\nPOST list (seq) ops/s | **4,370** | 3,301 | 2,975\nPOST list P50 latency | **0.20ms** | 0.25ms | 0.26ms\nPOST list (c=10) ops/s | **12,122** | 8,620 | 2,479\nGET by token (seq) ops/s | **6,866** | 5,049 | 1,024\nGET by token P50 latency | **0.13ms** | 0.19ms | 0.69ms\nMemory idle RSS | 44.5 MB | **56.1 MB** * | 121.8 MB\nBinary size | 10 MB | 17 MB | ~350 MB†\n\n*Swift RSS was lower in this session (44.5 MB vs Go’s 56.1 MB) but varies; historical range 27–56 MB.\n†Node.js binary size includes `node_modules`.\n\n* * *\n\n## What I Got Wrong\n\n### “Swift will be fastest”\n\nThe reasoning was: no garbage collector pauses (ARC handles memory), native machine code, Apple hardware. What I underestimated was the cost of Swift’s concurrency model. The actor system is genuinely powerful and its safety guarantees are worth having — but actor boundaries have real runtime cost, and the LLM naturally reached for the most “correct” structure without profiling implications in mind.\n\nBeyond concurrency, the Obj-C bridging legacy is a trap for anyone who learned Swift before Swift 5’s clean break. The `String(format:)` pattern is in tutorials, in Apple’s own documentation examples, in thousands of Stack Overflow answers. It’s idiomatic — and it’s expensive in such hot paths.\n\nSwift can be extremely fast. Getting there requires knowing which APIs are pure-Swift and which ones drop into Obj-C under the hood. That knowledge doesn’t come from reading documentation - it comes from benchmarking.\n\n### “Node.js will be dramatically slower”\n\nThe final gap on sequential writes is Go at 4,370 vs Node.js at 2,975 — roughly 1.5x, not an order of magnitude. On concurrent writes it’s worse (12,122 vs 2,479), but that’s a fundamental architectural constraint: `better-sqlite3` is synchronous and serializes on the main thread.\n\nNode.js’s strength is that when it can offload to C++, it does so efficiently. The native `crypto` module isn’t a JavaScript wrapper with overhead — it’s OpenSSL with a thin binding layer. The V8 engine has had decades of investment. The runtime isn’t slow; the question is whether you’re doing work in JavaScript or in the C++ layer it sits on top of.\n\nBravo to Node.JS for holding its own, even though being the icky slow Javacript it is said to run underneath.\n\n### “Go is a middle ground”\n\nGo won across the board, often by a significant margin. What I didn’t fully appreciate was how well optimized Go’s standard library is. The SHA-256 implementation uses BoringSSL for fast hash calculation. The hex encoding uses direct byte operations with no reflection. The HTTP server is production-grade and heavily optimized - it’s Google’s baby after all. Goroutines are lighweight and fast.\n\nThe LLM went with stdlib throughout the Go implementation, and took some impressive wins. Go team’s religious approach to simplicity, efficiency, and constant performance tuning of both runtime and the library surely inspires us vegans of the server-side world to thrive for the same when using it.\n\n* * *\n\n## Reflections on LLM-Assisted Development\n\nThe most interesting finding isn’t in the benchmark numbers — it’s in the pattern of where the LLMs went wrong.\n\nIn every case, the initial code was _correct_ by conventional standards. The Swift actor chain passed strict concurrency checking. The Web Crypto call used the officially recommended modern API. The `String(format:)` hex encoding is in the Swift documentation. None of these were bugs; they were choices that looked right until you measured them.\n\nLLMs optimize for code that looks right, reads well, and follows documented patterns. They’ve been trained on the entire corpus of human-written code, which skews heavily toward “working” rather than “optimal.” They don’t profile. They don’t have an intuition for what a particular abstraction costs at runtime.\n\nWhat an experienced developer adds to this loop isn’t more code — the LLM handles that. It’s the mental model that asks _“what is this actually doing at runtime?”_ when something looks clean already. It’s the habit of measuring before assuming. And, as this experiment showed, it’s the willingness to have your prior assumptions proven wrong by the numbers.\n\nThe Node.js version required the most guidance during development. It also revealed the clearest antipattern (Web Crypto async vs native sync) and delivered the cleanest optimization story. The more “autonomous” Go and Swift implementations had more interesting structural problems to untangle.\n\nI’m not sure what conclusion to draw from that, except that “less hand-holding” doesn’t mean “better code” — it means the LLM made its own decisions, and some of those decisions were questionable.\n\nSurely augmenting the system prompt for the coding environment, setting some proper guardrails and directions in agent instruction files, or using a set of advanced skills freely available for each ecosystem would steer the LLM in the right direction much faster, but the experiment was to see what the defaults are. And many of those defaults stem purely from the models’ knowledge cut-off, and those are a completely different story, a gradually becoming a much more depressing one.\n\n* * *\n\n## Numbers Don’t Lie, But They Do Require Context\n\nA few caveats worth noting:\n\n  * These benchmarks run on a single machine, in-process, with no network latency. Real-world throughput for all three would be bottlenecked by I/O long before hitting these numbers.\n  * Swift’s NIO event loop showed anomalous CPU usage (up to 576% at idle) in some sessions. The historical best for Swift concurrent writes was 14,030 ops/s — significantly higher than the 8,620 captured in the final comparison session.\n  * Node.js concurrent write throughput (2,479 ops/s) reflects a fundamental architectural choice: `better-sqlite3` is synchronous and serializes on the event loop thread. A different SQLite strategy (WAL + connection pool + worker threads) could change that picture, at the cost of implementation complexity.\n  * Memory numbers for Node.js (121.8 MB) include the V8 heap baseline. For long-running production services this may not matter; for constrained environments it does.\n  * The LLMs used throughout the experiment was a mix of localy hosted Qwen 3.6 27B, OpenCode Go’s DeepSeek V4 Flash, and some Claude Code, so the results may vary\n\n\n\n☉",
  "title": "Chasing the vibe-coded bottlenecks",
  "updatedAt": "2026-05-25T13:38:00.000Z"
}