Evals API: file_search return_value always empty — no way to see retrieval results
Overview
I’m running evals with file_search using the Responses data source type. The return_value on every file_search tool call is always "File search results: " (21 characters, empty) — for every item, every model (tested gpt-5.4 and gpt-4.1), whether retrieval clearly succeeded (10K+ prompt tokens, file citations present) or clearly failed (3K prompt tokens, no citations). Checked all three surfaces: API output items, dashboard JSONL export, and dashboard UI. All identical.
The Responses API supports include=["output[*].file_search_call.search_results"] to get this data. Tried passing it through the Evals API at sampling_params.include, data_source.include, and top-level include — all rejected with “Unknown parameter.” The available_includes field on output items is always [].
This makes it very difficult to debug eval failures. I can’t tell whether file_search executed, what it returned, or why a question failed — retrieval issue vs model issue.
Environment
Models tested:
gpt-5.4,gpt-4.1Eval data source type:
responseswithfile_searchtoolVector store: 100+ indexed markdown files, all status
completedReproduced via direct API calls (not SDK-specific)
Reproduction
Create an eval with a
responsesdata source that includesfile_searchas a toolRun the eval
Check the results through any of these three methods:
Method 1 — API:
Fetch output items via GET /evals/{eval_id}/runs/{run_id}/output_items. Check sample.output[].tool_calls[].function.return_value.
Method 2 — Dashboard JSONL export:
Export eval items from the dashboard. Check sample.outputs[].tool_calls[].function.return_value.
Method 3 — Dashboard UI:
View the eval run results in the browser.
All three methods return the same thing for every item:
"return_value": "File search results: "
This is identical for:
Items that clearly retrieved documents (10,000+ prompt tokens, file citations in response)
Items that retrieved nothing (3,000 prompt tokens, no citations, model says “I don’t have documentation”)
Both
gpt-5.4andgpt-4.1
What I expected
I expected the return_value field to contain the file search results (filenames, chunks, scores) — similar to how the Responses API returns them when you set include=["output[*].file_search_call.search_results"].
What I tried
Passinginclude through the Evals API:
I attempted to pass the include parameter at three levels when creating the eval run. All three were explicitly rejected:
"data_source.sampling_params.include": ["output[*].file_search_call.search_results"]
→ "Unknown parameter: 'data_source.sampling_params.include'"
"data_source.include": ["output[*].file_search_call.search_results"]
→ "Unknown parameter: 'data_source.include'"
"include": ["output[*].file_search_call.search_results"]
→ "Unknown parameter: 'include'"
The Evals API does not accept the include parameter that the Responses API supports.
Checking the dashboard export:
The dashboard JSONL export uses a different structure (trajectory/outputs vs the API’s output) but contains the same empty return_value on every item. No additional file_search data is present anywhere in the export.
Checking theavailable_includes field:
Each output item returned from the API has an available_includes field. It is always an empty array: "available_includes": [].
Why this matters
Without seeing file_search results, I cannot determine:
Whether file_search actually executed or silently failed
Whether it returned relevant documents that the model ignored
Whether the score threshold filtered out results that were close matches
What similarity scores the retrieved chunks had
Whether a test failure was caused by bad retrieval vs bad model reasoning
The only indirect signal available is prompt_tokens count — items with ~10,000 tokens likely had chunks injected into context, items with ~3,000 tokens (just the system prompt) likely received nothing. But this is a token count heuristic, not a direct observation of retrieval behavior.
Discussion in the ATmosphere