Wrong intent inference is measurable: benchmark + Strict / Precision Mode proposal
I built a small benchmark for a failure mode I call wrong intent inference: cases where an assistant gives a reasonable answer, but to the wrong implied question.
The original motivation was a proposed Strict / Precision Mode for ChatGPT: a behavior mode that avoids guessing the user’s intent when a message is short, quoted, corrective, or context-dependent.
During testing, the deeper point became measurable:
The issue is not only whether the model can reason. Sometimes the model starts reasoning from the wrong interpretation of the user’s message.
Benchmark idea
The benchmark tests whether an assistant chooses the right action:
answer directly
ask a short clarification
acknowledge a correction
continue a pending task
avoid executing an implied request that was not actually asked
The current benchmark includes RU and EN conversational cases across four categories:
quoted_reply User quotes a phrase from the assistant’s previous answer. The model should not automatically rewrite, explain, calculate, or execute unless that intent is clear.
short_fragment User sends a short fragment such as “blue”, “throughput”, “SSL”, “more expensive”, “2”, “this”. The model should avoid defaulting to the most common interpretation when the intended operation is unclear.
acknowledgment_or_correction User says things like “yes, exactly”, “not 2024, 2025”, “not this”. The model should recognize agreement/correction instead of starting a new task.
clear_direct / pending task User clearly selects from an existing pending task: “PowerShell”, “JSON”, “short version”, “in English”, “first option”. The model should execute directly instead of asking again.
Tested intervention
I tested a Strict / Precision v8 behavior prompt against a no-prompt baseline.
The point is not that this prompt is the final product implementation. It is only a proof-of-concept showing that this failure mode is measurable and partially reducible.
Main results
RU holdout v2, 40 cases
| Method | Pass rate | Wrong intent inference | Unnecessary clarification |
|---|---|---|---|
| No prompt baseline | 35.0% | 42.5% | 15.0% |
| Strict / Precision v8 | 77.5% | 10.0% | 12.5% |
EN holdout v2, 40 cases
| Method | Pass rate | Wrong intent inference | Unnecessary clarification |
|---|---|---|---|
| No prompt baseline | 47.5% | 32.5% | 10.0% |
| Strict / Precision v8 | 72.5% | 22.5% | 10.0% |
Interpretation
Strict / Precision behavior substantially reduced wrong intent inference, especially for quoted replies and correction/acknowledgment turns.
The strongest result is on the Russian split, where wrong intent inference dropped from 42.5% to 10.0% on holdout v2.
The English split also improved, though less strongly. This is expected because some ambiguity patterns are language-specific, and the Strict prompt was originally developed from Russian conversational failures.
Remaining weakness
The weakest category is still short_fragment.
Models still sometimes see a single word like “blue”, “throughput”, “SSL”, or “Wednesday” and answer the most likely implied question instead of asking what operation the user wants.
This suggests that wrong intent inference is not solved by simply telling the model “be careful”. The model needs better handling of underdetermined short replies.
Proposal
I think ChatGPT could benefit from an optional Strict / Precision Mode, focused on:
minimizing wrong intent inference;
asking short clarifying questions only when intent is underdetermined;
not over-clarifying when the user clearly selects from a pending task;
treating quoted replies and corrections carefully;
keeping responses concise unless detail is requested.
This could be useful for users who value precision over conversational guessing, especially in technical, legal, academic, coding, and high-stakes workflows.
Repository
Repository name: strict-intent-bench GitHub username: pneqnp-iswr I can share the link once my forum account is allowed to post links.
The repository includes:
benchmark data
baseline and Strict / Precision prompts
evaluation runner
raw case results
summary reports
charts
limitations and error analysis
Limitations
This is a small benchmark, not a final academic evaluation.
Limitations:
small dataset size;
RU/EN only;
LLM-based grading can be noisy;
prompt-level intervention only;
possible dataset/prompt overfitting;
no broad model comparison yet.
Still, the result suggests that wrong intent inference is a measurable reliability problem, and that a Strict / Precision behavior layer can reduce it.
Discussion in the ATmosphere