Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihk5nj645leqs4fufpfk7cawzr32kmtx2erhjtsp2lta6zxe7cnke",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mope63l4kqb2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreifyrz5qooyfbpm3qfu4xevx4rn564yjf44hazslbaw2a27tnu6brq"
    },
    "mimeType": "image/webp",
    "size": 340792
  },
  "path": "/prashantsasalatti/the-deadlock-that-only-showed-up-under-load-debugging-db2-sqlcode-811-in-production-32ol",
  "publishedAt": "2026-06-20T07:08:09.000Z",
  "site": "https://dev.to",
  "tags": [
    "database",
    "java",
    "backend",
    "debugging"
  ],
  "textContent": "_A real production incident, what the error actually meant, and the fix that stopped it from happening again._\n\n\n_Photo by Chris Ried on Unsplash_\n\nSome bugs are polite. They fail the same way every time, on your machine, with a clean stack trace pointing at the exact line. You fix them before lunch.\n\nThis was not one of those bugs.\n\nThis one only appeared in production, only under concurrent load, and only sometimes. The error code, DB2 **SQLCODE -811** , is one of those messages that looks like it's telling you something useful while telling you almost nothing. Here's how we tracked it down on a payroll system processing timecards for a large workforce, and what it taught me about writing code that survives concurrency.\n\n##  The symptom\n\nThe application is a Java EE enterprise system backed by DB2. During peak processing windows, a batch job that reconciled timecard records started throwing:\n\n\n\n    SQLCODE = -811, SQLSTATE = 21000\n\n\nIntermittently. Not every run. Not reproducible on demand. The kind of failure that makes you stare at logs at 3 AM wondering if you imagined it.\n\nThe job would succeed nine times out of ten. The tenth time, it blew up, and because this was payroll, blowing up one time in ten is not something you get to shrug off.\n\n\n_Two paths writing to the same place. Photo by Jordan Harrison on Unsplash._\n\n##  What SQLCODE -811 actually means\n\nHere's the part the error message hides. SQLCODE -811 is not a deadlock. It's not a lock timeout. It's this: a subquery, embedded SELECT, or SELECT INTO returned **more than one row** where the code expected exactly one.\n\nThat's it. A scalar context got a non-scalar answer.\n\nSo why was it intermittent? Because the duplicate row only existed transiently, during a window where two processes were both writing to the same logical record before either had committed. Under light load, the timing never lined up. Under heavy concurrent load, it did, and a query that assumed uniqueness suddenly found two rows mid-flight.\n\nThis is the trap with -811: the error points at the SELECT, but the SELECT isn't the bug. The bug is upstream, in how rows got created in the first place.\n\n##  Tracking it down\n\nThree things made this findable.\n\n**1. Stop trusting that it can't return two rows.** The query had a SELECT INTO against a working table, filtered on what everyone assumed was a unique business key. It wasn't enforced as unique at the database level, it was unique by convention. Convention does not survive a race condition. I added logging to dump the offending key and row count at the moment of failure. The next time it fired, the evidence was unambiguous: two rows, identical business key, created milliseconds apart.\n\n**2. Find who writes to that table, and when.** Two code paths inserted into the same working table. Neither was aware of the other. Both ran during the same processing window. Each one, in isolation, was correct. Together, they produced transient duplicates that a downstream SELECT INTO couldn't handle. This is the thing about concurrency bugs: every individual piece of code can be correct and the system can still be wrong. The defect lives in the gaps between components, not inside any one of them.\n\n**3. Reproduce it deliberately.** I could not fix what I could not reproduce. So I forced the race: two threads hammering the same key path concurrently against a test database. Within a few runs, -811 appeared on command. Once a heisenbug becomes reproducible, it stops being scary and becomes ordinary work.\n\n##  The fix\n\nThere were three layers to it, in order of how durable each one is.\n\n**Layer 1, enforce uniqueness where it actually lives.** A unique constraint at the database level. If the data model says a key is unique, the database should be the thing that guarantees it, not the hope that no two processes collide. This converts a silent, intermittent data corruption into a loud, immediate, catchable error.\n\n**Layer 2, make the query defensive.** Where a SELECT INTO genuinely could encounter more than one row in a valid state, the query needs to say which one it wants, ordering plus a single-row fetch, rather than assuming the result set has exactly one member. Don't ask the database for the row when a row is possible.\n\n**Layer 3, fix the write path.** The real cure: the two code paths needed coordination so they couldn't both create the same logical record in overlapping transactions. Constraints and defensive reads keep you safe; fixing the write path is what makes the problem actually go away.\n\nThe constraint and the defensive read are guardrails. The write-path fix is the cure. You want all three, because guardrails catch the next bug you haven't met yet.\n\n##  What I took away from it\n\n**The error code points at the crime scene, not the criminal.** -811 fired in the SELECT, but the SELECT was the victim. The actual bug was an unenforced uniqueness assumption upstream. Read past the line number.\n\n**Unique by convention is a bug with a delay on it.** Any invariant your code depends on but the database doesn't enforce is a race condition waiting for enough traffic. If it matters, constrain it.\n\n**Intermittent means timing, and timing means concurrency.** When a bug refuses to reproduce, stop trying to find the broken line and start looking for two things touching the same state at the same time. The intermittency is the clue, not the obstacle.\n\n**A heisenbug you can reproduce on demand is already half fixed.** Most of the work in a concurrency bug is making it deterministic. Once you can summon it, the fix is usually small.\n\nThe satisfying part wasn't the fix. It was that after it shipped, the 3 AM pages for that job stopped completely. The most boring outcome possible, nothing happening, is exactly what you're working toward in production support.\n\nIf you've debugged DB2 under concurrent load, you've probably met -811's cousins: -911 (deadlock/timeout rollback), -803 (duplicate key on a constraint that was enforced), -904 (resource unavailable). They all share the same lesson: the database is telling you the truth about your assumptions, usually at the worst possible hour.\n\n_What's the most stubborn intermittent bug you've shipped a fix for? I'd like to hear how you cornered it._",
  "title": "The Deadlock That Only Showed Up Under Load: Debugging DB2 SQLCODE -811 in Production"
}