External Publication
Visit Post

Why setTimeout is Lying to Your Retry Logic

DEV Community [Unofficial] June 17, 2026
Source

You've written retry logic. It probably looks something like this:

async function withRetry(fn, retries = 3) {
  for (let i = 0; i < retries; i++) {
    try {
      return await fn();
    } catch (err) {
      if (i === retries - 1) throw err;
      await new Promise(r => setTimeout(r, 200 * (i + 1)));
    }
  }
}

You test it locally. You simulate a slow dependency like this:

const fakeDB = async () => {
  await new Promise(r => setTimeout(r, 200)); // simulate DB
  return { id: 1, name: 'test' };
};

Your retry logic works. Tests pass. You ship it.

Then in production, your app starts dropping requests under load.

The problem isn't your retry logic. It's your fake.

Real dependencies don't have flat latency

Here's what your Postgres instance actually looks like in production:

  • p50: 5ms — half of all queries finish in under 5ms
  • p95: 50ms — 95% finish under 50ms
  • p99: 200ms — 99% finish under 200ms
  • p99.9: 2000ms — that one unlucky query during a GC pause

Your setTimeout(fn, 200) simulates the worst case, every single time. That's not how production works. And because it's not how production works, your retry logic has never actually been tested against reality.

The bugs hide in the variance — not in the slow case, but in the unpredictability.

What the real distribution looks like

Latency in distributed systems follows a lognormal distribution. It's right-skewed: most requests are fast, a meaningful minority are slow, and a small tail is very slow.

This shape comes from how real systems work:

  • GC pauses — Java, Go, and even Node's garbage collector occasionally stops the world
  • Cold caches — first query after a cache miss is always slower
  • Network jitter — packet routing isn't deterministic
  • Noisy neighbors — other workloads on the same hardware compete for resources
  • Connection pool exhaustion — when all connections are busy, new queries wait

None of these are constant. They're random, rare, and multiplicative — which is exactly what produces a lognormal shape.

Why this matters for retry logic specifically

Consider this scenario: your p99 latency is 200ms and your timeout is 250ms.

With setTimeout(fn, 200), every test call takes exactly 200ms — safely under your timeout. Tests pass.

In production, the lognormal tail means 0.1% of calls take 500ms or more. Your 250ms timeout fires, your retry triggers, and now you're sending the same request again to an already-stressed database. Under load, this cascades.

This is the exact failure mode that causes retry storms — and it only appears in production because your local tests used flat delays.

The bugs that flat delays hide:

  • Timeouts that are too tight for the real p99
  • Retry logic that amplifies load instead of handling it gracefully
  • Circuit breakers that never open during tests but open constantly in production
  • Backoff strategies that feel correct locally but collapse under real variance

The fix: simulate real latency distributions

Instead of a flat delay, fit a lognormal distribution to real p50/p99 values and sample from it. Every call gets a different delay — most are fast, some are slow, a few are very slow. Just like production.

Here's the math:

function fitLognormal(p50, p99) {
  // p50 = median = e^mu  →  mu = ln(p50)
  // p99 = e^(mu + 2.326*sigma)
  const mu = Math.log(p50);
  const sigma = (Math.log(p99) - mu) / 2.326;
  return { mu, sigma };
}

function sampleLatency(p50, p99) {
  const { mu, sigma } = fitLognormal(p50, p99);
  // Box-Muller transform
  const u1 = Math.random(), u2 = Math.random();
  const z = Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);
  return Math.exp(mu + sigma * z);
}

Call sampleLatency(5, 200) ten times and you'll get something like:

3ms, 7ms, 2ms, 12ms, 4ms, 180ms, 6ms, 3ms, 9ms, 440ms

That's what your database actually looks like.

Using slowdep

I built slowdep to make this a one-liner. It wraps any async function with a lognormal latency profile — either a built-in preset or your own p50/p99 values.

npm install slowdep



import { withLatency } from 'slowdep';

// before: flat fake
const fakeDB = async (id) => ({ id, name: 'test' });

// after: realistic latency
const fakeDB = withLatency(async (id) => ({ id, name: 'test' }), 'postgres');

Now run your retry logic against it:

const result = await withRetry(() => fakeDB(42));

You'll immediately see things you didn't see before:

  • Some retries succeed on the second attempt (realistic)
  • Occasional calls hit your timeout (revealing tight timeouts)
  • Rare calls cascade into all retries failing (revealing missing backoff jitter)

Built-in presets cover the most common dependencies:

Preset p50 p99 Error rate
'postgres' 5ms 200ms 0.1%
'redis' 1ms 20ms 0.05%
'stripe' 200ms 2000ms 0.2%
'openai' 800ms 8000ms 0.5%
's3' 30ms 500ms 0.1%

You can also pass custom profiles:

const slowFetch = withLatency(fetchAPI, {
  p50: 100,
  p99: 3000,
  errorRate: 0.02, // 2% transient errors
});

The real test

Here's what testing retry logic actually looks like with realistic latency:

import { withLatency } from 'slowdep';

// realistic postgres simulation
const db = withLatency(async (id) => {
  return { id, name: 'Arnav' };
}, 'postgres');

// your retry logic
async function withRetry(fn, retries = 3, baseDelay = 100) {
  for (let i = 0; i < retries; i++) {
    try {
      return await fn();
    } catch (err) {
      if (i === retries - 1) throw err;
      // exponential backoff with jitter
      const delay = baseDelay * Math.pow(2, i) * (0.5 + Math.random() * 0.5);
      await new Promise(r => setTimeout(r, delay));
    }
  }
}

// now you're actually testing against production-like behavior
const result = await withRetry(() => db.findUser(42));

Run this a hundred times. Watch which calls fail. Tune your timeouts and backoff based on what you see. That's actual resilience testing — not false confidence from a flat 200ms.

Summary

  • Real dependency latency is lognormal: fast most of the time, occasionally slow, rarely very slow
  • setTimeout(fn, 200) tests only the worst case, every time — it hides the bugs that only appear from variance
  • Fitting a lognormal distribution to your p50/p99 values gives you realistic simulation in one function call
  • slowdep wraps any async function with zero dependencies and built-in presets for postgres, redis, stripe, openai, s3, and more

If your retry logic has never been tested against real latency variance, it probably has bugs you haven't found yet.

Source code and presets: github.com/arnnnavvvvv/slowdep

Discussion in the ATmosphere

Loading comments...