{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreierdtbbmgiqmjbpfko6bunfre5h4kgt3iubadt3lhyiw7arp73u5m",
"uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mhf47gvvbqo2"
},
"path": "/t/evals-api-file-search-return-value-always-empty-no-way-to-see-retrieval-results/1377132#post_1",
"publishedAt": "2026-03-19T02:07:22.000Z",
"site": "https://community.openai.com",
"textContent": "## Overview\n\nI’m running evals with `file_search` using the Responses data source type. The `return_value` on every file_search tool call is always `\"File search results: \"` (21 characters, empty) — for every item, every model (tested gpt-5.4 and gpt-4.1), whether retrieval clearly succeeded (10K+ prompt tokens, file citations present) or clearly failed (3K prompt tokens, no citations). Checked all three surfaces: API output items, dashboard JSONL export, and dashboard UI. All identical.\n\nThe Responses API supports `include=[\"output[*].file_search_call.search_results\"]` to get this data. Tried passing it through the Evals API at `sampling_params.include`, `data_source.include`, and top-level `include` — all rejected with “Unknown parameter.” The `available_includes` field on output items is always `[]`.\n\nThis makes it very difficult to debug eval failures. I can’t tell whether file_search executed, what it returned, or why a question failed — retrieval issue vs model issue.\n\n## Environment\n\n * Models tested: `gpt-5.4`, `gpt-4.1`\n\n * Eval data source type: `responses` with `file_search` tool\n\n * Vector store: 100+ indexed markdown files, all status `completed`\n\n * Reproduced via direct API calls (not SDK-specific)\n\n\n\n\n## Reproduction\n\n 1. Create an eval with a `responses` data source that includes `file_search` as a tool\n\n 2. Run the eval\n\n 3. Check the results through any of these three methods:\n\n\n\n\n**Method 1 — API:**\n\nFetch output items via `GET /evals/{eval_id}/runs/{run_id}/output_items`. Check `sample.output[].tool_calls[].function.return_value`.\n\n**Method 2 — Dashboard JSONL export:**\n\nExport eval items from the dashboard. Check `sample.outputs[].tool_calls[].function.return_value`.\n\n**Method 3 — Dashboard UI:**\n\nView the eval run results in the browser.\n\n**All three methods return the same thing for every item:**\n\n\n \"return_value\": \"File search results: \"\n\n\n\nThis is identical for:\n\n * Items that clearly retrieved documents (10,000+ prompt tokens, file citations in response)\n\n * Items that retrieved nothing (3,000 prompt tokens, no citations, model says “I don’t have documentation”)\n\n * Both `gpt-5.4` and `gpt-4.1`\n\n\n\n\n## What I expected\n\nI expected the `return_value` field to contain the file search results (filenames, chunks, scores) — similar to how the Responses API returns them when you set `include=[\"output[*].file_search_call.search_results\"]`.\n\n## What I tried\n\n**Passing`include` through the Evals API:**\n\nI attempted to pass the `include` parameter at three levels when creating the eval run. All three were explicitly rejected:\n\n\n \"data_source.sampling_params.include\": [\"output[*].file_search_call.search_results\"]\n\n → \"Unknown parameter: 'data_source.sampling_params.include'\"\n\n \"data_source.include\": [\"output[*].file_search_call.search_results\"]\n\n → \"Unknown parameter: 'data_source.include'\"\n\n \"include\": [\"output[*].file_search_call.search_results\"]\n\n → \"Unknown parameter: 'include'\"\n\n\n\nThe Evals API does not accept the `include` parameter that the Responses API supports.\n\n**Checking the dashboard export:**\n\nThe dashboard JSONL export uses a different structure (`trajectory`/`outputs` vs the API’s `output`) but contains the same empty `return_value` on every item. No additional file_search data is present anywhere in the export.\n\n**Checking the`available_includes` field:**\n\nEach output item returned from the API has an `available_includes` field. It is always an empty array: `\"available_includes\": []`.\n\n## Why this matters\n\nWithout seeing file_search results, I cannot determine:\n\n * Whether file_search actually executed or silently failed\n\n * Whether it returned relevant documents that the model ignored\n\n * Whether the score threshold filtered out results that were close matches\n\n * What similarity scores the retrieved chunks had\n\n * Whether a test failure was caused by bad retrieval vs bad model reasoning\n\n\n\n\nThe only indirect signal available is `prompt_tokens` count — items with ~10,000 tokens likely had chunks injected into context, items with ~3,000 tokens (just the system prompt) likely received nothing. But this is a token count heuristic, not a direct observation of retrieval behavior.",
"title": "Evals API: file_search return_value always empty — no way to see retrieval results"
}