Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidvxuclhgxf3rb5bbipnk7uxz72l3uo5fsbi67zef374wccmcioca",
    "uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mli7nmhe72h2"
  },
  "path": "/t/responses-api-strict-json-schema-returns-malformed-json-when-combined-with-file-search-include-file-search-call-results/1380608#post_1",
  "publishedAt": "2026-05-10T07:13:35.000Z",
  "site": "https://community.openai.com",
  "textContent": "## Summary\n\nWith `text.format` set to `json_schema` + `strict: true`, and `tools` including\n`file_search` (a vector store) and `web_search`, and `include` containing\n`file_search_call.results`, the Responses API intermittently returns malformed\nJSON in `output_text.text` while reporting `status: completed` and\n`incomplete_details: null`. Streaming and non-streaming both reproduce.\n\nFailure rate measured against `gpt-5.4-mini-2026-03-17`:\n\nConfiguration | Failure rate (n=20)\n---|---\nAs below (baseline, the trigger config) | ~20% (3–4 / 20)\nSame body without `include: file_search_call.results` | ~10%\nSame body without `tools` | 0/20\nSame body without `tool_choice: required` (-> `auto`) | 0/20 (small N)\n\nThe malformed output is _not_ a truncation — strict mode would fail-closed on\ntruncation. It’s a **structurally invalid** sequence: the model emits one valid\n`\"key\":\"value\"` pair, then a second value preceded by only a `:` (no comma, no\nkey for the second field). Every failure I observed has the same shape.\n\n## Symptom (verbatim from `output_text.text`)\n\n\n    {\"headline_summary\":\"Apple’s most recent transcript in the files is its Q3 FY2025 earnings call, where management leaned hard on record services revenue, strong iPhone demand, and confidence in China; the stock now sits at $415.12, down 1.3\":\"Cautiously constructive: the narrative is upbeat, but the price action reads as incremental validation, not a euphoric rerating.\"}\n\n\nToken sequence:\n\n  * `{` `\"headline_summary\"` `:` `\"<long string>\"` **`:`** `\"<value>\"` `}`\n\n\n\nThe middle `:` is the failure — it should be `,\"overall_sentiment\":` per the schema.\n\nThe full `response.completed` event reports:\n\n  * `status: \"completed\"`\n  * `incomplete_details: null`\n  * `text.format.strict: true`\n  * The output `message.content[0].text` carries the broken JSON\n  * The annotations array on that `OutputText` has correct character indices into\nthe (broken) text\n\n\n\n## Reproduction\n\nTested against `https://api.openai.com/v1/responses` with `urllib.request`\n(Python 3.11) on macOS. The script reproduces the bug on a fresh, throwaway\nvector store with a single 1-line markdown file. ~20% failure rate over 20\nruns.\n\n\n    # Minimal request body that reproduces (extracted byte-for-byte from\n    # the call our app makes, then bisected). Replace VS_ID with a real\n    # vector store containing at least one indexed file.\n    {\n      \"model\": \"gpt-5.4-mini-2026-03-17\",\n      \"stream\": true,\n      \"input\": [\n        {\"role\": \"developer\", \"type\": \"message\",\n         \"content\": \"\\n---\\n\\nYour entire response must be valid JSON matching this shape exactly. Use the field descriptions to decide what to put in each field.\\n\\nExample response:\\n```json\\n{\\n  \\\"headline_summary\\\": \\\"\\\",\\n  \\\"overall_sentiment\\\": \\\"\\\"\\n}\\n```\\n\\nField descriptions:\\n- `headline_summary` (text)\\n- `overall_sentiment` (text)\\n\"},\n        {\"role\": \"user\", \"type\": \"message\",\n         \"content\": \"Use the file_search tool to find the most recent earnings call transcript matching the Ticker below. Pull out the most quotable management claim from the call. Then web-search the current stock price action since that call. Write a 4-sentence pithy take that contrasts narrative vs market reality. Cite the file_search source and one web URL.\\n\\nTicker: AAPL\\n\\nName: Apple Inc.\"}\n      ],\n      \"include\": [\"file_search_call.results\", \"reasoning.encrypted_content\"],\n      \"reasoning\": {\"effort\": \"none\", \"summary\": \"auto\"},\n      \"text\": {\n        \"format\": {\n          \"type\": \"json_schema\",\n          \"name\": \"ai_request_output\",\n          \"strict\": true,\n          \"schema\": {\n            \"type\": \"object\",\n            \"additionalProperties\": false,\n            \"required\": [\"headline_summary\", \"overall_sentiment\"],\n            \"properties\": {\n              \"headline_summary\":  {\"type\": \"string\"},\n              \"overall_sentiment\": {\"type\": \"string\"}\n            }\n          }\n        }\n      },\n      \"tool_choice\": \"required\",\n      \"tools\": [\n        {\"type\": \"file_search\",\n         \"vector_store_ids\": [\"VS_ID\"]},\n        {\"type\": \"web_search\", \"search_context_size\": \"medium\"}\n      ]\n    }\n\n\nRun it ~20 times and parse the final `output_text.text` as JSON. Any\n`json.JSONDecodeError` on a `status: completed`, `incomplete_details: null`\nresponse is the bug.\n\n## Bisect (n=15–20 each)\n\nStarting from the body above, single-variable changes:\n\nChange | Failure rate\n---|---\nbaseline | 3/15\nremove `include: file_search_call.results` | 0/15 ✓ (n=15); 2/20 (n=20 retest)\nremove `include: reasoning.encrypted_content` | 1/5 (small N)\nremove `include` entirely | 0/5\n`reasoning.summary: auto` → `detailed` | 4/15 (no help)\nremove `reasoning` entirely | 1/5\nremove `store: null` | 0/5\n`tool_choice: required` → `auto` | 0/5\nremove `web_search` | 0/5\nremove `file_search` | 0/5\n`stream: true` → `false` | did not test under same prompt; failure observed in both modes in separate runs\n\nThe strongest single trigger is **`include: file_search_call.results`** — but\neven removing it leaves a residual ~10% rate, so the trigger is the\ncombination, not a single field.\n\n## Expected behaviour\n\nUnder `strict: true`, the model’s output grammar should be enforced such that\nany returned `output_text.text` for a `status: completed` /\n`incomplete_details: null` response parses as valid JSON conforming to the\nschema. Any failure to satisfy the schema should manifest as\n`incomplete_details.reason` (e.g. `max_output_tokens`, `content_filter`),\nnot as malformed JSON in a “completed” response.\n\n## Workaround\n\nIn our app we now (1) skip `include: file_search_call.results` whenever\n`text.format` is a strict `json_schema`, and (2) retry the call up to 3 times\non local JSON parse failure when strict structured output was requested.\nCombined this drops residual failures below 0.1%, but the trade-off is that\nwe lose per-file match details when structured outputs are also configured.\n\n## Environment\n\n  * API: Responses API (`POST /v1/responses`)\n  * Model: `gpt-5.4-mini-2026-03-17` (also seen on the unversioned `gpt-5.4-mini`)\n  * Streaming: bug reproduces both with and without `stream: true`\n  * Date observed: 2026-05-09\n\n",
  "title": "Responses API: strict json_schema returns malformed JSON when combined with file_search + `include: file_search_call.results`"
}