Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidwkbl6wutgphoh7bi7dllpgsihgrds3zs6jsnb36exjxeyqavqju",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mojbnbfq6qr2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreickoktzh53bouzo3cejvvabffpelw2tsmwaiaxkfuspx6qupcnz6m"
    },
    "mimeType": "image/webp",
    "size": 51902
  },
  "path": "/aarnxvvv/why-settimeout-is-lying-to-your-retry-logic-126b",
  "publishedAt": "2026-06-17T21:22:41.000Z",
  "site": "https://dev.to",
  "tags": [
    "node",
    "javascript",
    "webdev",
    "testing",
    "slowdep",
    "github.com/arnnnavvvvv/slowdep"
  ],
  "textContent": "You've written retry logic. It probably looks something like this:\n\n\n\n    async function withRetry(fn, retries = 3) {\n      for (let i = 0; i < retries; i++) {\n        try {\n          return await fn();\n        } catch (err) {\n          if (i === retries - 1) throw err;\n          await new Promise(r => setTimeout(r, 200 * (i + 1)));\n        }\n      }\n    }\n\n\nYou test it locally. You simulate a slow dependency like this:\n\n\n\n    const fakeDB = async () => {\n      await new Promise(r => setTimeout(r, 200)); // simulate DB\n      return { id: 1, name: 'test' };\n    };\n\n\nYour retry logic works. Tests pass. You ship it.\n\nThen in production, your app starts dropping requests under load.\n\n**The problem isn't your retry logic. It's your fake.**\n\n##  Real dependencies don't have flat latency\n\nHere's what your Postgres instance actually looks like in production:\n\n  * **p50: 5ms** — half of all queries finish in under 5ms\n  * **p95: 50ms** — 95% finish under 50ms\n  * **p99: 200ms** — 99% finish under 200ms\n  * **p99.9: 2000ms** — that one unlucky query during a GC pause\n\n\n\nYour `setTimeout(fn, 200)` simulates the worst case, every single time. That's not how production works. And because it's not how production works, your retry logic has never actually been tested against reality.\n\nThe bugs hide in the variance — not in the slow case, but in the unpredictability.\n\n##  What the real distribution looks like\n\nLatency in distributed systems follows a **lognormal distribution**. It's right-skewed: most requests are fast, a meaningful minority are slow, and a small tail is very slow.\n\nThis shape comes from how real systems work:\n\n  * **GC pauses** — Java, Go, and even Node's garbage collector occasionally stops the world\n  * **Cold caches** — first query after a cache miss is always slower\n  * **Network jitter** — packet routing isn't deterministic\n  * **Noisy neighbors** — other workloads on the same hardware compete for resources\n  * **Connection pool exhaustion** — when all connections are busy, new queries wait\n\n\n\nNone of these are constant. They're random, rare, and multiplicative — which is exactly what produces a lognormal shape.\n\n##  Why this matters for retry logic specifically\n\nConsider this scenario: your p99 latency is 200ms and your timeout is 250ms.\n\nWith `setTimeout(fn, 200)`, every test call takes exactly 200ms — safely under your timeout. Tests pass.\n\nIn production, the lognormal tail means 0.1% of calls take 500ms or more. Your 250ms timeout fires, your retry triggers, and now you're sending the same request again to an already-stressed database. Under load, this cascades.\n\nThis is the exact failure mode that causes **retry storms** — and it only appears in production because your local tests used flat delays.\n\nThe bugs that flat delays hide:\n\n  * Timeouts that are too tight for the real p99\n  * Retry logic that amplifies load instead of handling it gracefully\n  * Circuit breakers that never open during tests but open constantly in production\n  * Backoff strategies that feel correct locally but collapse under real variance\n\n\n\n##  The fix: simulate real latency distributions\n\nInstead of a flat delay, fit a lognormal distribution to real p50/p99 values and sample from it. Every call gets a different delay — most are fast, some are slow, a few are very slow. Just like production.\n\nHere's the math:\n\n\n\n    function fitLognormal(p50, p99) {\n      // p50 = median = e^mu  →  mu = ln(p50)\n      // p99 = e^(mu + 2.326*sigma)\n      const mu = Math.log(p50);\n      const sigma = (Math.log(p99) - mu) / 2.326;\n      return { mu, sigma };\n    }\n\n    function sampleLatency(p50, p99) {\n      const { mu, sigma } = fitLognormal(p50, p99);\n      // Box-Muller transform\n      const u1 = Math.random(), u2 = Math.random();\n      const z = Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);\n      return Math.exp(mu + sigma * z);\n    }\n\n\nCall `sampleLatency(5, 200)` ten times and you'll get something like:\n\n\n\n    3ms, 7ms, 2ms, 12ms, 4ms, 180ms, 6ms, 3ms, 9ms, 440ms\n\n\nThat's what your database actually looks like.\n\n##  Using slowdep\n\nI built slowdep to make this a one-liner. It wraps any async function with a lognormal latency profile — either a built-in preset or your own p50/p99 values.\n\n\n\n    npm install slowdep\n\n\n\n    import { withLatency } from 'slowdep';\n\n    // before: flat fake\n    const fakeDB = async (id) => ({ id, name: 'test' });\n\n    // after: realistic latency\n    const fakeDB = withLatency(async (id) => ({ id, name: 'test' }), 'postgres');\n\n\nNow run your retry logic against it:\n\n\n\n    const result = await withRetry(() => fakeDB(42));\n\n\nYou'll immediately see things you didn't see before:\n\n  * Some retries succeed on the second attempt (realistic)\n  * Occasional calls hit your timeout (revealing tight timeouts)\n  * Rare calls cascade into all retries failing (revealing missing backoff jitter)\n\n\n\nBuilt-in presets cover the most common dependencies:\n\nPreset | p50 | p99 | Error rate\n---|---|---|---\n`'postgres'` | 5ms | 200ms | 0.1%\n`'redis'` | 1ms | 20ms | 0.05%\n`'stripe'` | 200ms | 2000ms | 0.2%\n`'openai'` | 800ms | 8000ms | 0.5%\n`'s3'` | 30ms | 500ms | 0.1%\n\nYou can also pass custom profiles:\n\n\n\n    const slowFetch = withLatency(fetchAPI, {\n      p50: 100,\n      p99: 3000,\n      errorRate: 0.02, // 2% transient errors\n    });\n\n\n##  The real test\n\nHere's what testing retry logic actually looks like with realistic latency:\n\n\n\n    import { withLatency } from 'slowdep';\n\n    // realistic postgres simulation\n    const db = withLatency(async (id) => {\n      return { id, name: 'Arnav' };\n    }, 'postgres');\n\n    // your retry logic\n    async function withRetry(fn, retries = 3, baseDelay = 100) {\n      for (let i = 0; i < retries; i++) {\n        try {\n          return await fn();\n        } catch (err) {\n          if (i === retries - 1) throw err;\n          // exponential backoff with jitter\n          const delay = baseDelay * Math.pow(2, i) * (0.5 + Math.random() * 0.5);\n          await new Promise(r => setTimeout(r, delay));\n        }\n      }\n    }\n\n    // now you're actually testing against production-like behavior\n    const result = await withRetry(() => db.findUser(42));\n\n\nRun this a hundred times. Watch which calls fail. Tune your timeouts and backoff based on what you see. That's actual resilience testing — not false confidence from a flat 200ms.\n\n##  Summary\n\n  * Real dependency latency is lognormal: fast most of the time, occasionally slow, rarely very slow\n  * `setTimeout(fn, 200)` tests only the worst case, every time — it hides the bugs that only appear from variance\n  * Fitting a lognormal distribution to your p50/p99 values gives you realistic simulation in one function call\n  * slowdep wraps any async function with zero dependencies and built-in presets for postgres, redis, stripe, openai, s3, and more\n\n\n\nIf your retry logic has never been tested against real latency variance, it probably has bugs you haven't found yet.\n\n_Source code and presets: github.com/arnnnavvvvv/slowdep_",
  "title": "Why setTimeout is Lying to Your Retry Logic"
}