{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreihfpxvlrmys5lynuzqtwdrtyn63gymmcdugndoev6mzocc2ms6gpy",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moqfq5wawmg2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreicb5ix5vnpezxzgyknxlef4bsk7ncsfodezy5lha3cdvggumpjrhi"
},
"mimeType": "image/webp",
"size": 103396
},
"path": "/uthman_dev/i-built-a-distributed-job-queue-in-go-to-understand-how-they-actually-work-fhk",
"publishedAt": "2026-06-20T17:29:19.000Z",
"site": "https://dev.to",
"tags": [
"architecture",
"distributedsystems",
"go",
"showdev"
],
"textContent": "I have used job queues my whole developer life without knowing what was inside them.\n\nSo I built one.\n\nNot a wrapper around an existing queue. A full implementation from scratch\nwith Redis, PostgreSQL, goroutines, and real failure handling.\n\nHere is everything I learned.\n\n## Why Dual Storage\n\nMost job queues use one store. Redis is fast. PostgreSQL is durable. I wanted both.\n\nRedis handles dispatch via a sorted set priority queue. Fast enqueue, fast dequeue.\n\nPostgreSQL is the source of truth. Every job lives there permanently.\n\nThe rule: no critical state lives only in Redis. If Redis wipes completely,\nno job is lost. PostgreSQL has everything.\n\n## Three Things Running Concurrently\n\n * A **worker pool** that executes jobs\n * A **scheduler** that promotes jobs from PostgreSQL into Redis when their time arrives\n * A **stale reaper** that detects crashed workers and requeues their jobs automatically\n\n\n\nAll three run as goroutines. All three coordinate without stepping on each other.\n\n## What Happens When a Worker Crashes\n\nThis is the part most tutorials skip.\n\nWhen a worker picks up a job it marks it as in-progress. If that worker crashes\nmid-execution the job stays marked in-progress forever unless something intervenes.\n\nThe stale reaper scans for jobs that have been in-progress longer than their timeout.\nIt requeues them automatically with exponential backoff.\n\nNo manual intervention. No lost jobs.\n\n## The Numbers\n\nMetric | Result\n---|---\nJob registration | 52ns/op, 0 allocations\nJob execution | 950ns/op\n\nBenchmarked with Go's built-in benchmark tooling.\n\nShips with a Prometheus metrics endpoint and a pre-built Grafana dashboard\ncovering queue depth, throughput, and failure rates by job type.\n\n## What I Actually Understand Now\n\n * Why Redis alone is not enough for a job queue\n * Why crashed worker recovery needs to be a first class feature not an afterthought\n * Why exponential backoff matters more than immediate retries\n\n\n\nThe project is open source with one external contributor already.\n\n * GitHub: github.com/codetesla51/kyu\n * Landing page: kyu-job-queue.vercel.app\n\n\n\n\n\n `\n",
"title": "I built a distributed job queue in Go to understand how they actually work"
}