Responses API: strict json_schema returns malformed JSON when combined with file_search + `include: file_search_call.results`
Summary
With text.format set to json_schema + strict: true, and tools including
file_search (a vector store) and web_search, and include containing
file_search_call.results, the Responses API intermittently returns malformed
JSON in output_text.text while reporting status: completed and
incomplete_details: null. Streaming and non-streaming both reproduce.
Failure rate measured against gpt-5.4-mini-2026-03-17:
| Configuration | Failure rate (n=20) |
|---|---|
| As below (baseline, the trigger config) | ~20% (3–4 / 20) |
Same body without include: file_search_call.results |
~10% |
Same body without tools |
0/20 |
Same body without tool_choice: required (-> auto) |
0/20 (small N) |
The malformed output is not a truncation — strict mode would fail-closed on
truncation. It’s a structurally invalid sequence: the model emits one valid
"key":"value" pair, then a second value preceded by only a : (no comma, no
key for the second field). Every failure I observed has the same shape.
Symptom (verbatim from output_text.text)
{"headline_summary":"Apple’s most recent transcript in the files is its Q3 FY2025 earnings call, where management leaned hard on record services revenue, strong iPhone demand, and confidence in China; the stock now sits at $415.12, down 1.3":"Cautiously constructive: the narrative is upbeat, but the price action reads as incremental validation, not a euphoric rerating."}
Token sequence:
{"headline_summary":"<long string>":"<value>"}
The middle : is the failure — it should be ,"overall_sentiment": per the schema.
The full response.completed event reports:
status: "completed"incomplete_details: nulltext.format.strict: true- The output
message.content[0].textcarries the broken JSON - The annotations array on that
OutputTexthas correct character indices into the (broken) text
Reproduction
Tested against https://api.openai.com/v1/responses with urllib.request
(Python 3.11) on macOS. The script reproduces the bug on a fresh, throwaway
vector store with a single 1-line markdown file. ~20% failure rate over 20
runs.
# Minimal request body that reproduces (extracted byte-for-byte from
# the call our app makes, then bisected). Replace VS_ID with a real
# vector store containing at least one indexed file.
{
"model": "gpt-5.4-mini-2026-03-17",
"stream": true,
"input": [
{"role": "developer", "type": "message",
"content": "\n---\n\nYour entire response must be valid JSON matching this shape exactly. Use the field descriptions to decide what to put in each field.\n\nExample response:\n```json\n{\n \"headline_summary\": \"\",\n \"overall_sentiment\": \"\"\n}\n```\n\nField descriptions:\n- `headline_summary` (text)\n- `overall_sentiment` (text)\n"},
{"role": "user", "type": "message",
"content": "Use the file_search tool to find the most recent earnings call transcript matching the Ticker below. Pull out the most quotable management claim from the call. Then web-search the current stock price action since that call. Write a 4-sentence pithy take that contrasts narrative vs market reality. Cite the file_search source and one web URL.\n\nTicker: AAPL\n\nName: Apple Inc."}
],
"include": ["file_search_call.results", "reasoning.encrypted_content"],
"reasoning": {"effort": "none", "summary": "auto"},
"text": {
"format": {
"type": "json_schema",
"name": "ai_request_output",
"strict": true,
"schema": {
"type": "object",
"additionalProperties": false,
"required": ["headline_summary", "overall_sentiment"],
"properties": {
"headline_summary": {"type": "string"},
"overall_sentiment": {"type": "string"}
}
}
}
},
"tool_choice": "required",
"tools": [
{"type": "file_search",
"vector_store_ids": ["VS_ID"]},
{"type": "web_search", "search_context_size": "medium"}
]
}
Run it ~20 times and parse the final output_text.text as JSON. Any
json.JSONDecodeError on a status: completed, incomplete_details: null
response is the bug.
Bisect (n=15–20 each)
Starting from the body above, single-variable changes:
| Change | Failure rate |
|---|---|
| baseline | 3/15 |
remove include: file_search_call.results |
0/15 ✓ (n=15); 2/20 (n=20 retest) |
remove include: reasoning.encrypted_content |
1/5 (small N) |
remove include entirely |
0/5 |
reasoning.summary: auto → detailed |
4/15 (no help) |
remove reasoning entirely |
1/5 |
remove store: null |
0/5 |
tool_choice: required → auto |
0/5 |
remove web_search |
0/5 |
remove file_search |
0/5 |
stream: true → false |
did not test under same prompt; failure observed in both modes in separate runs |
The strongest single trigger is include: file_search_call.results — but
even removing it leaves a residual ~10% rate, so the trigger is the
combination, not a single field.
Expected behaviour
Under strict: true, the model’s output grammar should be enforced such that
any returned output_text.text for a status: completed /
incomplete_details: null response parses as valid JSON conforming to the
schema. Any failure to satisfy the schema should manifest as
incomplete_details.reason (e.g. max_output_tokens, content_filter),
not as malformed JSON in a “completed” response.
Workaround
In our app we now (1) skip include: file_search_call.results whenever
text.format is a strict json_schema, and (2) retry the call up to 3 times
on local JSON parse failure when strict structured output was requested.
Combined this drops residual failures below 0.1%, but the trade-off is that
we lose per-file match details when structured outputs are also configured.
Environment
- API: Responses API (
POST /v1/responses) - Model:
gpt-5.4-mini-2026-03-17(also seen on the unversionedgpt-5.4-mini) - Streaming: bug reproduces both with and without
stream: true - Date observed: 2026-05-09
Discussion in the ATmosphere