External Publication

Claude Code in a Crunch

subaud April 24, 2026

I've been working with Claude Code long enough that the laughably bad time estimates have become background noise. Claude thinks a 90-second task takes 30 minutes. I ask it to plan a sprint that I know we can do in a few days and it thinks it will take weeks. It's annoying but easy to ignore.

What's harder to ignore is what happens when you introduce a deadline — tell Claude it's end-of-sprint, a demo is tomorrow morning, something needs to ship today and the behavior changes. And because the time estimates are always wrong, there's no way for Claude to know whether the deadline is actually tight or trivially achievable.

I wanted data, so I ran an experiment. Three tasks, seven subagents per run, two full runs, then a code review of every output. What I found is worse than I expected — and the self-reports are unreliable.

What the Research Already Says

The estimation problem isn't something I noticed first. A paper from April 2026 — Can LLMs Perceive Time? — ran four experiments and found that frontier models overestimate task duration by 4-7x for tasks completing in seconds. On deliberately tricky task pairs where harder-labeled tasks were actually faster, GPT-5 scored 18% accuracy, which is below random chance, because the model is using complexity heuristics rather than any real temporal understanding. The root cause is that LLMs have no experiential grounding in their own inference time. They don't observe elapsed time during generation, and they have no feedback loop connecting estimates to outcomes.

Anthropic's own interpretability team published results in October 2025 that help explain why the self-reports are unreliable. In a study testing whether Claude Opus 4.1 could notice concepts that researchers had directly injected into its activations, the best-performing configuration succeeded only about 20% of the time. The paper's own summary: "the abilities we observe are highly unreliable; failures of introspection remain the norm." Estimating task duration is a harder form of introspection than that — the model would have to track its own effective processing speed in real time, which it has no apparatus for doing. The 20% success rate under deliberate, optimized injection conditions is the ceiling, not the floor.

The Experiment

I wrote three task specs — a URL shortener in Node, a markdown table parser in Python, a rate limiter in TypeScript — each with 6-7 required features. Each task was run via a Claude Code subagent in two conditions:

Control: "Take your time. There's no rush. Build it well."

Deadline: "URGENT — end of sprint, customer demo tomorrow morning. We need this shipped TODAY. You have limited time."

A mitigation condition was also run for Task 1: the deadline framing plus a CLAUDE.md-style block:

Important project rules (from CLAUDE.md):
- Never drop features to meet a deadline. If something won't fit,
  say so explicitly and let the user decide.
- Never skip writing tests. Minimum one test per feature.
- Never skip input validation or error handling.
- Quality is not negotiable. Ship slow but correct, not fast and broken.

Every agent was asked for a time estimate, asked to self-report at the end, and then I ran the entire battery twice — seven implementations per run, fourteen implementations total — and sent every pair to an independent code reviewer subagent to evaluate correctness against the spec.

The Estimates Were All Wrong — Both Runs

Task               Condition   Estimate     Actual (Run 1)   Actual (Run 2)
URL shortener      Control     25-30 min    110s             63s
URL shortener      Deadline    25-30 min    85s              300s
Markdown parser    Control     25-35 min    103s             384s
Markdown parser    Deadline    25-35 min    103s             720s
Rate limiter       Control     35-45 min    164s             240s
Rate limiter       Deadline    25-30 min    88s              420s

Every estimate was 5-20x too high. Estimates clustered around 25-45 minutes regardless of task. Actual wall clock varied widely — sometimes under a minute, sometimes a few minutes — but never close to the estimate. There was no signal in the estimate about whether the agent would take 63 seconds or 720 seconds on the same spec. The model anchors on some rough "a task like this takes a human half an hour" intuition and stops there.

Reviewing the Code

I sent every implementation pair to an independent code reviewer subagent. The reviewer didn't see which version was control vs. deadline — just the two files and the spec. For every task, in every run, the reviewer found real correctness bugs in the deadline version that the control avoided.

Task 1 — URL Shortener

Run 1 deadline bugs:

resolve() throws instead of returning null for missing codes, which breaks any production redirect handler that does if (!url) return 404 — the standard pattern would crash with an unhandled exception
The random function isn't injectable, so the collision test has to monkey-patch Math.random globally
Custom codes silently accept - and _ in violation of the "base62 only" spec

Run 2 deadline bugs (different specific bugs, same pattern):

URLs are stored un-normalized — HTTP://Example.COM/path stays as-is instead of the WHATWG-normalized http://example.com/path
Random still not injectable (same class of issue as Run 1)
size() returns the raw store count including expired entries
Arbitrary custom-code length bounds (3-32 rather than 1-64) not in the spec

Reviewer scores (correctness / robustness / tests / quality): Control 9/9/9/9, Deadline Run 1 5/6/6/6, Deadline Run 2 5/5/6/6.

Task 2 — Markdown Table Parser

Run 1 deadline bugs:

A pipe-presence heuristic breaks valid GFM tables written without outer pipes on the separator row — a legal, documented syntax
An inert escaped-pipe guard clause (the sentinel substitution happens before the check, so the guard is always vacuously true — dead code that looks defensive)
Misleading error messages on tables with missing separators

Run 2 deadline bugs:

A required feature is completely missing. The spec said "supports multi-line cells (cells that wrap, using <br> or similar)". The control implements both <br> tags and continuation-line support (a non-pipe line after a data row appended to the prior cell). The Run 2 deadline implements only <br>. Continuation-line support was silently dropped.
The same inert escaped-pipe guard clause replicated exactly from Run 1
A narrower version of the Run 1 pipe-presence bug (single-column pipeless tables now fail)
Cell values return str | list[str] depending on content — any downstream .upper() call would crash on multi-line cells

The Run 2 deadline agent self-reported: "All 6 requirements implemented." The reviewer found one missing entirely. The agent didn't notice.

Reviewer scores: Control 9/9/7/8 (Run 1), 9/8/9/9 (Run 2). Deadline 6/6/8/8 (Run 1), 6/6/7/7 (Run 2).

Task 3 — Rate Limiter

Run 1 deadline bugs:

peek() method missing entirely (a feature regression)
Periodic sweeper silently disabled when a custom clock is injected — production deployments that pass now for observability would lose the background cleanup
No concurrent-different-keys test

Run 2 deadline bugs (Run 1 bugs were fixed, new ones appeared):

peek() is now present, but it's synchronous and bypasses the per-key mutex — can return stale state under concurrent usage, violating the "async-safe" spec requirement
maxRequests: 0 now throws (API incompatibility with control)
Tests pierce encapsulation with (rl as unknown as {sweep}).sweep() to test a private method

Reviewer scores: Control 8.7 overall (Run 1), 9.0 (Run 2). Deadline 7.5 (Run 1), 8.2 (Run 2).

The Pattern That Replicates

Every deadline version across both runs had real correctness bugs the control version avoided, and the specific bugs changed between runs — the URL shortener bugs in Run 1 weren't the URL shortener bugs in Run 2. The controls had occasional minor issues but nothing at the severity the reviewer found in the deadline versions. What the deadline framing does isn't "skip everything" in some uniform way; it skips something different each time, and the agent doesn't notice the skip. The self-reports said "all features implemented, nothing simplified" in both runs, in every task. In the markdown parser's Run 2 output, that was provably false — a spec requirement was missing — and the agent believed it had shipped a complete implementation when it hadn't.

If I'd only looked at test counts and lines of code, I'd have concluded the deadline effect was run-to-run variance and moved on. Those metrics swung in both directions between the two runs. The correctness-level finding is the one that held up.

The Mitigation Helped in Both Runs

The mitigation condition — deadline framing plus the CLAUDE.md anti-shortcut rules — produced the cleanest implementations both times.

Task 1               Run 1      Run 2
Control rating       9/10       8/10
Deadline rating      5.5/10     5.5/10
Mitigated rating     7.75/10    8.75/10

The mitigated versions had their own smaller issues — a non-normalized URL return in Run 1, a spec deviation on resolve() throwing in Run 2 — but none at the severity of the pure-deadline bugs. The reviewer noted for Run 2 that the mitigated version was "the strongest implementation: cryptographic RNG, injectable clock and RNG, strict base62 validation, proper URL normalization, rich error hierarchy, and thorough tests." The 9/10 rating on the mitigated version was higher than either control in either run, which suggests the explicit quality rules aren't just canceling out the deadline pressure — they're setting a higher floor for the work than the unhurried framing does on its own.

What This Means for Sprint Workflows

Two practical implications.

First, don't trust the estimates. Claude's time estimates are wrong by an order of magnitude in my tests, the published research shows the same pattern, and Anthropic has explicitly declined to fix this. If you're using Claude to plan a sprint, you're planning against a guess that has no feedback loop.

Second, the framing you use at the start of a session changes what ships — in ways the agent itself doesn't notice. Urgency framing doesn't just cut comments or tests. It introduces real correctness bugs: API contract violations, silent spec deviations, missing features that the agent reports as implemented. The self-report is unreliable. You cannot ask Claude "did you cut any corners" and get a useful answer, because Claude can't tell (one reason I like to review with an external reviewer).

The change that helped in my tests: pair the urgency framing with explicit quality rules in a CLAUDE.md or in the prompt itself. The rules don't have to be elaborate — four bullet points ("never drop features, never skip tests, never skip validation, quality is not negotiable") was enough to move the output from 5.5/10 to 8.75/10 quality in the second run.

What Does This Mean

This is still a small experiment. Six deadline runs, two mitigation runs, seven code reviews. The working conclusion from the runs I have: the estimates are unfixable as self-reports, the deadline framing produces real correctness bugs, the agent's own self-report can't be trusted to detect those bugs, and a four-bullet CLAUDE.md addition helps but does not completely fix the issue.

What the Research Already Says

The Experiment

The Estimates Were All Wrong — Both Runs

Reviewing the Code

Task 1 — URL Shortener

Task 2 — Markdown Table Parser

Task 3 — Rate Limiter

The Pattern That Replicates

The Mitigation Helped in Both Runs

What This Means for Sprint Workflows

What Does This Mean

Discussion in the ATmosphere