{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreig6nuogzpreefenc5br6dga63u6p4mmuh26zhqdcskt7qnxrqr7yq",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mowozfc3thw2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreihnuklg2gelcho3k4iv7ifsldmybjygislnetxqmk6tjsxhtl7x2y"
},
"mimeType": "image/webp",
"size": 77482
},
"path": "/pheonix_mk_e0ecc0233ababe/building-a-reliable-webhook-delivery-system-what-actually-broke-and-how-i-fixed-it-l74",
"publishedAt": "2026-06-23T05:24:48.000Z",
"site": "https://dev.to",
"tags": [
"api",
"backend",
"python",
"systemdesign"
],
"textContent": "Webhooks seem simple until a worker crashes mid-delivery, a subscriber goes down for an hour, or a payload gets tampered with in transit.\n\nHere's what I actually built to handle that — FastAPI + PostgreSQL + Redis.\n\n**The core problems I solved:**\n\n**1. Synchronous delivery blocks everything**\nNaive approach calls the subscriber URL inline. One slow endpoint stalls your whole ingest. Fix: return `202 Accepted` immediately, persist the event, deliver async.\n\n**2. Workers crash and jobs disappear**\nIf a worker dies mid-delivery, that job is stuck `IN_FLIGHT` forever. Fix: a watchdog sweeping every 30s, requeuing anything stale.\n\n**3. Retries without backoff make things worse**\nHammering a struggling subscriber on failure makes recovery harder. Fix: exponential backoff (2s → 32s, max 5 attempts) using a Redis sorted set as a delay queue — score = next attempt timestamp.\n\n**4. One dead subscriber degrades the whole system**\nFix: circuit breaker per subscription. 5 consecutive failures trips it OPEN. After 60s cooldown, one probe tests recovery before resuming.\n\n**5. No payload integrity**\nFix: per-subscription HMAC-SHA256 signature on every payload, verified with `hmac.compare_digest` to eliminate timing attacks.\n\n**Result:** 99.9% delivery reliability across 10,000+ daily webhooks, with full visibility via Prometheus + Grafana.\n\nFull deep-dive coming soon.",
"title": "Building a Reliable Webhook Delivery System: What Actually Broke and How I Fixed It"
}