Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibs4pmnc7c6tv3igtfzzasf22qv2brdzt5qo7mbakk6ho74wwkfda",
    "uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mlpsecmfqdx2"
  },
  "path": "/t/wrong-intent-inference-is-measurable-benchmark-strict-precision-mode-proposal/1380779#post_1",
  "publishedAt": "2026-05-13T06:47:39.000Z",
  "site": "https://community.openai.com",
  "textContent": "I built a small benchmark for a failure mode I call wrong intent inference: cases where an assistant gives a reasonable answer, but to the wrong implied question.\n\nThe original motivation was a proposed Strict / Precision Mode for ChatGPT: a behavior mode that avoids guessing the user’s intent when a message is short, quoted, corrective, or context-dependent.\n\nDuring testing, the deeper point became measurable:\n\n> The issue is not only whether the model can reason.\n>  Sometimes the model starts reasoning from the wrong interpretation of the user’s message.\n\n## Benchmark idea\n\nThe benchmark tests whether an assistant chooses the right action:\n\n  * answer directly\n\n  * ask a short clarification\n\n  * acknowledge a correction\n\n  * continue a pending task\n\n  * avoid executing an implied request that was not actually asked\n\n\n\n\nThe current benchmark includes RU and EN conversational cases across four categories:\n\n  1. quoted_reply\nUser quotes a phrase from the assistant’s previous answer. The model should not automatically rewrite, explain, calculate, or execute unless that intent is clear.\n\n  2. short_fragment\nUser sends a short fragment such as “blue”, “throughput”, “SSL”, “more expensive”, “2”, “this”. The model should avoid defaulting to the most common interpretation when the intended operation is unclear.\n\n  3. acknowledgment_or_correction\nUser says things like “yes, exactly”, “not 2024, 2025”, “not this”. The model should recognize agreement/correction instead of starting a new task.\n\n  4. clear_direct / pending task\nUser clearly selects from an existing pending task: “PowerShell”, “JSON”, “short version”, “in English”, “first option”. The model should execute directly instead of asking again.\n\n\n\n\n## Tested intervention\n\nI tested a Strict / Precision v8 behavior prompt against a no-prompt baseline.\n\nThe point is not that this prompt is the final product implementation. It is only a proof-of-concept showing that this failure mode is measurable and partially reducible.\n\n## Main results\n\n### RU holdout v2, 40 cases\n\nMethod | Pass rate | Wrong intent inference | Unnecessary clarification\n---|---|---|---\nNo prompt baseline | 35.0% | 42.5% | 15.0%\nStrict / Precision v8 | 77.5% | 10.0% | 12.5%\n\n### EN holdout v2, 40 cases\n\nMethod | Pass rate | Wrong intent inference | Unnecessary clarification\n---|---|---|---\nNo prompt baseline | 47.5% | 32.5% | 10.0%\nStrict / Precision v8 | 72.5% | 22.5% | 10.0%\n\n## Interpretation\n\nStrict / Precision behavior substantially reduced wrong intent inference, especially for quoted replies and correction/acknowledgment turns.\n\nThe strongest result is on the Russian split, where wrong intent inference dropped from 42.5% to 10.0% on holdout v2.\n\nThe English split also improved, though less strongly. This is expected because some ambiguity patterns are language-specific, and the Strict prompt was originally developed from Russian conversational failures.\n\n## Remaining weakness\n\nThe weakest category is still short_fragment.\n\nModels still sometimes see a single word like “blue”, “throughput”, “SSL”, or “Wednesday” and answer the most likely implied question instead of asking what operation the user wants.\n\nThis suggests that wrong intent inference is not solved by simply telling the model “be careful”. The model needs better handling of underdetermined short replies.\n\n## Proposal\n\nI think ChatGPT could benefit from an optional Strict / Precision Mode, focused on:\n\n  * minimizing wrong intent inference;\n\n  * asking short clarifying questions only when intent is underdetermined;\n\n  * not over-clarifying when the user clearly selects from a pending task;\n\n  * treating quoted replies and corrections carefully;\n\n  * keeping responses concise unless detail is requested.\n\n\n\n\nThis could be useful for users who value precision over conversational guessing, especially in technical, legal, academic, coding, and high-stakes workflows.\n\n## Repository\n\nRepository name: strict-intent-bench\nGitHub username: pneqnp-iswr\nI can share the link once my forum account is allowed to post links.\n\nThe repository includes:\n\n  * benchmark data\n\n  * baseline and Strict / Precision prompts\n\n  * evaluation runner\n\n  * raw case results\n\n  * summary reports\n\n  * charts\n\n  * limitations and error analysis\n\n\n\n\n## Limitations\n\nThis is a small benchmark, not a final academic evaluation.\n\nLimitations:\n\n  * small dataset size;\n\n  * RU/EN only;\n\n  * LLM-based grading can be noisy;\n\n  * prompt-level intervention only;\n\n  * possible dataset/prompt overfitting;\n\n  * no broad model comparison yet.\n\n\n\n\nStill, the result suggests that wrong intent inference is a measurable reliability problem, and that a Strict / Precision behavior layer can reduce it.",
  "title": "Wrong intent inference is measurable: benchmark + Strict / Precision Mode proposal"
}