Raw Record Source

{
  "path": "/posts/2025/gemini-hidden-reasoning/index",
  "site": "at://did:plc:mracrip6qu3vw46nbewg44sm/site.standard.publication/self",
  "$type": "site.standard.document",
  "title": "Gemini Hidden Reasoning",
  "updatedAt": "2025-12-30T23:06:56.059Z",
  "description": "The performance costs of thinking and provider defaults",
  "publishedAt": "2025-07-07T01:15:12.000Z",
  "textContent": "While building Tomo and several prototypes using LLMs, I've experimented with several popular language models.\nIt's generally easier to prototype using the OpenAI chat responses API because most providers support this early API spec (mostly).\nThis approach makes it pretty simple to switch between models and providers by changing model, base_url, and api_key.\n\nAfter watching tokens stream for the umpteenth time while testing, I started to get different vibes from different models.\n\nSome thoughts that crossed my mind:\n\n- Why is Anthropic's streaming so choppy?\n- Wow, OpenAI's streaming is so smooth!\n- Why did the time to first token for Gemini get so much longer from gemini-2.0-flash to gemini-2.5-flash?\n\nToday is the day we find out what these differences are and if my preconceived notions are actually backed up by experimentation.\n\nNaive streaming with different models\n\nLet's write some code to stream inference of two models (one older, one newer) from Anthropic, Gemini, and OpenAI.\n\nThe output is a lot so I collapsed it.\nI'm going to aggregate some stats across several runs later, but you can see a few things even before we do that I want to take a look.\n\n- The tokens streamed by OpenAI are smaller than those of the other providers, and there is very little latency between tokens. This likely makes the streaming feel smooth.\n- Time to first token increases significantly from gemini-1.5-flash to gemini-2.5-flash. We'll come back to that.\n- Gemini's streamed chunks are larger than those of the other providers.\n- Anthropic's chunk size and time between chunks are a bit variable. Gemini's chunks are large enough (and the total inference output for this prompt small enough) that it's hard to tell how it compares off the bat.\n\nEnough qualitative stuff.\n\nLet's run some stats.\n\nI changed the prompt to elicit a longer inference response.\n\n> Write a story about a robot.\n\nWe're going to do five runs of each model and measure the time to first token, chunk latency (between chunks 2-3, 3-4, etc.), and chunk size in characters.\n\nThis took a little while to run and I could have parallelized.\nFortunately, the first run got the stats I was looking to see.\n\nAggregate Statistics\n\n| Model                      | TTFT (mean/stdev) | Chunk Latency (mean/stdev) | Chunk Size (mean/stdev) |\n| -------------------------- | ----------------- | -------------------------- | ----------------------- |\n| gpt-3.5-turbo              | 0.676s / 0.480s   | 0.005s / 0.012s            | 5.023 / 2.689           |\n| gpt-4o                     | 1.690s / 2.292s   | 0.019s / 0.051s            | 4.799 / 2.713           |\n| gemini-1.5-flash           | 0.466s / 0.075s   | 0.293s / 0.152s            | 204.328 / 112.207       |\n| gemini-2.5-flash           | 9.763s / 0.921s   | 0.358s / 0.113s            | 219.664 / 36.831        |\n| claude-3-5-sonnet-20240620 | 0.755s / 0.208s   | 0.051s / 0.048s            | 13.322 / 7.398          |\n| claude-sonnet-4-20250514   | 1.375s / 0.158s   | 0.408s / 0.205s            | 56.379 / 18.561         |\n\n!Plot of Claude, Gemini, and GPT models' time to first token, chunk latency, and chunk size\n\nI wasn't too surprised by the results, having seen the raw stats from the first run. The providers have different chunk sizes, and latency between chunks increases roughly proportionally to the chunk size. I didn't expect Anthropic to be as competitive with OpenAI in terms of time to first token.\n\nOne thing that stands out a lot is gemini-2.5-flash's time to first token.\n\nDifferent approaches to reasoning\n\nWhat is going on here?\nThis latency clearly isn't a one-off since we did five runs.\n\nIn the Gemini documentation on how to call the Gemini models using the OpenAI API, it says:\n\n> Unlike the Gemini API, the OpenAI API offers three levels of thinking control: \"low\", \"medium\", and \"high\", which map to 1,024, 8,192, and 24,576 tokens, respectively.\n\n> If you want to disable thinking, you can set reasoning_effort to \"none\" (note that reasoning cannot be turned off for 2.5 Pro models).\n\nI actually didn't realize gemini-2.5-flash was a reasoning model.\nIs it possible the reason tokens are being held back from the streaming by default?\nIs reasoning \"on\" by default for gemini-2.5-flash?\n\nHow does Anthropic handle reasoning for Sonnet 4 with the OpenAI API?\n\nIt turns out both providers offer different controls for reasoning.\n\nAnthropic calls it \"extended thinking support\".\n\nGoogle calls it \"thinking\" and \"reasoning effort\".\n\nSo let's play with those.\nI was curious to see how or if they're supported in streaming mode.\n\nStarting with Claude, I tried with claude-3-5-sonnet-20240620.\n\nMakes sense.\nThis isn't a reasoning model.\n\nMoving to claude-sonnet-4-20250514, for various token budget sizes\n\nThe chart below contains the aggregated stats for five runs of claude-sonnet-4-20250514 with different thinking modes.\nThe suffixes denote the thinking token budget given to Claude for the runs that were aggregated in the plot.\nNo suffixes indicate the default reasoning behavior of the model.\n\n!An image comparing the time to first token, chunk latency, and chunk size for the thinking modes for claude-sonnet-4-20250514\n\nWith a thinking token budget, the time to first token is higher.\nInspecting the response chunks, we don't see any of the thinking.\nIt turns out, using the OpenAI Python client, we can't see the thinking:\n\n> The OpenAI SDK won't return Claude's detailed thought process\n> https://docs.anthropic.com/en/api/openai-sdk#extended-thinking-support\n\nSo, the time to first token is higher when thinking, and that seems to be relatively constant across the specified thinking token budgets.\n\nWith Gemini, I tried with gemini-1.5-flash and got this error\n\nWith gemini-2.5-flash, I first tried with reasoning_effort=\"low\".\nTime to first token was still a few seconds.\nThe reasoning tokens did not stream.\n\nSwitching to the \"thinking\" approach (with a budget of a lot of tokens)\n\non inspection of the response coming back from the model, we see text like\n\nSo when we include_thoughts, this is what they look like.\nAlso, when we include thoughts, the time to the first token goes down, which makes sense because otherwise, we're waiting for the model to think but not \"seeing\" it think.\nWe can further validate this by setting thinking_budget to 0 and reasoning_effort to \"none\".\n\nThe chart below contains the aggregated stats for five runs of the gemini-2.5-flash model with different thinking modes.\nThe numeric suffixes denote the thinking token budget, while the others indicate the reasoning effort given to Gemini for the runs aggregated in the plot.\nNo suffixes indicate the default reasoning behavior of the model.\n\n!An image comparing the time to first token, chunk latency, and chunk size for the thinking modes for gemini-2.5-flash\n\nWhen reasoning is disabled, either by setting a token budget of 0 (resulting in a time to first token of 0.40s) or by using a reasoning effort of \"none\" (resulting in 0.43s), the time to the first token closely matches that of gemini-1.5-flash, which is 0.46s, seen in aggregate stats.\n\nGemini gives us a way to specify how much thinking we want the model to do, or we can give the model a specific thinking token budget.\nGiven the time to the first token for thinking budgets of 1,000 and 10,000 are nearly identical (1.56s and 1.57s respectively), it seems plausible that the model uses its own judgment to figure out how much thinking is needed.\nThis approach seems to align with how Claude implements thinking, and Anthropic doesn't give us a way to require the model to think more than _it thinks_ it needs to, like Gemini does with reasoning_effort.\n\nTakeaways\n\nI started this investigation wanting to understand why the time to first token for gemini-2.5-flash was so much higher than the other models and prior versions of Gemini.\nIt turns out gemini-2.5-flash does thinking (probably as reasoning_effort=\"medium\" or reasoning_effort=\"high\") by default but hides it.\nHowever, it turns out this thinking can also be disabled if you explicitly do so.\n\nThe investigation suggests that the straightforward use of the OpenAI chat completion API across different providers is nearing its end, especially for applications needing more than just text and image input support. With provider-specific features like PDF support, varied reasoning methods, Citations, and OpenAI's shift away from the chat responses API for newer models, the hope for a unified API standard for LLMs is fading.\n\nI am heartened to see OpenRouter serving as the layer of consistency across models and providers.\nI hope they're able to continue to make this work and would love to see more of a push for an independent API standard.\nLiteLLM is also a great project that is continuing to make this work where they can.\n\nMany LLM APIs now also contain nontrivial product features.\nWhat could have been some sort of protocol has morphed into competing products with competing visions.",
  "canonicalUrl": "https://www.danielcorin.com/posts/2025/gemini-hidden-reasoning/index"
}