External Publication

Wrong intent inference benchmark v0.8: full English 80-case result

OpenAI Developer Community May 15, 2026

Follow-up to my earlier post about wrong intent inference.

I published a v0.8 update for strict-intent-bench, a benchmark-first repo focused on cases where an assistant gives a plausible answer, but to the wrong implied question.

The new result is an English v0.3 full 80-case run.

Metric	Baseline	Strict / Precision v13
Action accuracy	38.8%	66.2%
Wrong intent inference	37.5%	5.0%
Metadata unnecessary clarification	15.0%	11.2%
Missing needed clarification	30.0%	15.0%

The current best intervention is Strict / Precision v13.

The main result is not “strict prompting solves this.” It does not. The stronger claim is narrower:

A stricter action-selection policy can substantially reduce wrong intent inference while still exposing a real clarification trade-off.

I also tested later prompt variants. Some improved targeted failure cases, but regressed on broader checks, so v13 remains the current public champion.

Main remaining weak spots:

short fragments;
ask-clarification vs topic-selection boundaries;
continue-pending-task cases;
automated grader noise. Repo: GitHub - pneqnp-iswr/strict-intent-bench: Benchmark and static demo for wrong intent inference in AI assistants, with Strict / Precision behavior as a tested intervention. · GitHub

The repo includes the benchmark data, prompt variants, evaluation runner, full reports, and the v0.8 decision summary.

Repo:

github.com

GitHub - pneqnp-iswr/strict-intent-bench: Benchmark and static demo for wrong intent...

Benchmark and static demo for wrong intent inference in AI assistants, with Strict / Precision behavior as a tested intervention.

Live demo:

pneqnp-iswr.github.io

GitHub - pneqnp-iswr/strict-intent-bench: Benchmark and static demo for wrong intent...

strict-intent-bench

Discussion in the ATmosphere