External Publication
Visit Post

Wrong intent inference benchmark v0.8: full English 80-case result

OpenAI Developer Community May 15, 2026
Source

Follow-up to my earlier post about wrong intent inference.

I published a v0.8 update for strict-intent-bench, a benchmark-first repo focused on cases where an assistant gives a plausible answer, but to the wrong implied question.

The new result is an English v0.3 full 80-case run.

Metric Baseline Strict / Precision v13
Action accuracy 38.8% 66.2%
Wrong intent inference 37.5% 5.0%
Metadata unnecessary clarification 15.0% 11.2%
Missing needed clarification 30.0% 15.0%

The current best intervention is Strict / Precision v13.

The main result is not “strict prompting solves this.” It does not. The stronger claim is narrower:

A stricter action-selection policy can substantially reduce wrong intent inference while still exposing a real clarification trade-off.

I also tested later prompt variants. Some improved targeted failure cases, but regressed on broader checks, so v13 remains the current public champion.

Main remaining weak spots:

  • short fragments;
  • ask-clarification vs topic-selection boundaries;
  • continue-pending-task cases;
  • automated grader noise. Repo: GitHub - pneqnp-iswr/strict-intent-bench: Benchmark and static demo for wrong intent inference in AI assistants, with Strict / Precision behavior as a tested intervention. · GitHub

The repo includes the benchmark data, prompt variants, evaluation runner, full reports, and the v0.8 decision summary.

Repo:

github.com

GitHub - pneqnp-iswr/strict-intent-bench: Benchmark and static demo for wrong intent...

Benchmark and static demo for wrong intent inference in AI assistants, with Strict / Precision behavior as a tested intervention.

Live demo:

pneqnp-iswr.github.io

strict-intent-bench

Discussion in the ATmosphere

Loading comments...