Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifcguno7dqlnptd4akxrqmfrnab5sma35ubov6q2ox4eblrgdsxku",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjhbi47gike2"
  },
  "path": "/t/debugging-opus-4-6-why-claude-codes-reasoning-depth-dropped-67-and-what-to-do-about-it/175236#post_1",
  "publishedAt": "2026-04-14T09:40:24.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "EvoLink’s",
    "@om_patel5"
  ],
  "textContent": "# Debugging Opus 4.6: Why Claude Code’s Reasoning Depth Dropped 67% and What to Do About It\n\nTwo configuration changes. Zero announcements. A 67% drop in reasoning depth across 6,852 sessions.\n\nIf you’ve been building with Claude Code and noticed degraded output quality since February 2026, this post traces the exact root cause and walks through the fix.\n\n## Symptoms\n\nDevelopers started reporting these patterns in early April:\n\n  * Multi-step reasoning tasks that previously succeeded now fail or produce incomplete results\n  * The model skips reading files it should analyze\n  * Fabricated API references and non-existent function calls appear in output\n  * Simple tasks still work fine, but anything requiring depth breaks down\n\n\n\nThe pattern is consistent: shallow tasks are unaffected, deep tasks are degraded.\n\n## Quantifying the Damage\n\n### BridgeBench Data\n\nBridgeBench tests AI models on code analysis hallucination — 30 tasks, 175 questions, execution-verified ground truth.\n\nOpus 4.6 moved from #2 (83.3% accuracy, ~17% fabrication) to #10 (68.3% accuracy, 33% fabrication) in the span of weeks.\n\nThe full picture:\n\nModel | Accuracy | Fabrication Rate | Rank\n---|---|---|---\nGrok 4.20 Reasoning | 91.8% | 10.0% | #1\nGPT-5.4 | 86.1% | 16.7% | #2\nClaude Opus 4.5 | 72.3% | 27.9% | #6\n**Claude Opus 4.6** | **68.3%** | **33.0%** | **#10**\n\nTwo things stand out:\n\n  1. Opus 4.6 is less accurate than its predecessor (4.5)\n  2. Sonnet 4.6 (72.4% accuracy) — a smaller model — outperforms it\n\n\n\n### Session-Level Analysis\n\nAn AMD executive’s analysis of 6,852 sessions quantified a 67% drop in reasoning depth. Developer Om Patel’s controlled A/B test (same prompt, 4.6 vs 4.5) showed 4.6 failing 5/5 times while 4.5 passed 5/5. His tweet documenting this reached 682K views.\n\n## Root Cause Analysis\n\nTwo changes in Anthropic’s defaults compound to create the degradation:\n\n### Change 1: Effort Level Default (March 3, 2026)\n\nThe `effort` parameter controls how much reasoning the model applies. It was changed from `high` to `medium`.\n\nUnder `medium`, the model applies a cost-saving heuristic: estimate task complexity, allocate proportional effort. The failure mode is systematic underestimation — complex tasks get classified as simple and receive insufficient reasoning.\n\n### Change 2: Adaptive Thinking (February 9, 2026)\n\nA new mechanism lets the model dynamically allocate reasoning tokens per conversation turn. Under `medium` effort, this can result in **zero reasoning tokens** for certain turns.\n\nThe interaction between these changes is the core issue: medium effort + adaptive thinking = the model sometimes literally doesn’t think before responding.\n\n## Fix: Three Tiers\n\n### Tier 1: Per-Session Override\n\n\n    /effort max\n\n\nForces maximum reasoning depth for the current session. Must be re-applied each time.\n\n### Tier 2: Permanent Environment Configuration\n\n\n    # Add to .bashrc / .zshrc\n    export CLAUDE_CODE_EFFORT_LEVEL=max\n    export CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1\n\n\nThis persists across sessions and prevents adaptive thinking from zeroing out reasoning tokens.\n\n### Tier 3: Model Fallback\n\nSwitch to Opus 4.5: `claude-opus-4-5-20251101`\n\nTrade-off: slower inference, higher token cost, but consistent reasoning quality.\n\n## Architecture Consideration: Model Routing\n\nThis incident highlights a practical problem for teams using LLMs in production: when a provider silently changes model behavior, you need the ability to reroute without code changes.\n\nEvoLink’s unified API gateway provides a single endpoint for 30+ models. The Smart Router (`evolink/auto`) can route reasoning-heavy tasks to models with lower hallucination rates automatically. When model quality is a moving target, routing flexibility is an architectural requirement.\n\n## Timeline\n\nDate | Event\n---|---\nFeb 9, 2026 | Adaptive thinking introduced\nMar 3, 2026 | Effort default: high → medium\nApr 10, 2026 | Om Patel’s canary test tweet (682K views)\nApr 14, 2026 | BridgeBench confirms #10 ranking, 33% fabrication\n\n## What to Watch\n\n  * Whether Anthropic reverts the effort default or refines adaptive thinking\n  * BridgeBench trajectory over coming weeks\n  * Community-developed canary tests for detecting silent model changes\n\n\n\n* * *\n\n_Sources: BridgeBench (bridgebench.ai/hallucination), @om_patel5 on X, GitHub Issue #42796, Digit.in, pasqualepillitteri.it_",
  "title": "Debugging Opus 4.6: Why Claude Code's Reasoning Depth Dropped 67% and What to Do About It"
}