Wrong intent inference benchmark v0.8: full English 80-case result
Follow-up to my earlier post about wrong intent inference.
I published a v0.8 update for strict-intent-bench, a benchmark-first repo focused on cases where an assistant gives a plausible answer, but to the wrong implied question.
The new result is an English v0.3 full 80-case run.
| Metric | Baseline | Strict / Precision v13 |
|---|---|---|
| Action accuracy | 38.8% | 66.2% |
| Wrong intent inference | 37.5% | 5.0% |
| Metadata unnecessary clarification | 15.0% | 11.2% |
| Missing needed clarification | 30.0% | 15.0% |
The current best intervention is Strict / Precision v13.
The main result is not “strict prompting solves this.” It does not. The stronger claim is narrower:
A stricter action-selection policy can substantially reduce wrong intent inference while still exposing a real clarification trade-off.
I also tested later prompt variants. Some improved targeted failure cases, but regressed on broader checks, so v13 remains the current public champion.
Main remaining weak spots:
- short fragments;
- ask-clarification vs topic-selection boundaries;
- continue-pending-task cases;
- automated grader noise. Repo: GitHub - pneqnp-iswr/strict-intent-bench: Benchmark and static demo for wrong intent inference in AI assistants, with Strict / Precision behavior as a tested intervention. · GitHub
The repo includes the benchmark data, prompt variants, evaluation runner, full reports, and the v0.8 decision summary.
Repo:
github.com
GitHub - pneqnp-iswr/strict-intent-bench: Benchmark and static demo for wrong intent...
Benchmark and static demo for wrong intent inference in AI assistants, with Strict / Precision behavior as a tested intervention.
Live demo:
pneqnp-iswr.github.io
Discussion in the ATmosphere