Are There Any Good Benchmarks Comparing OpenAI API Models?
Thanks @PaulBellow
A note up front: I am not a mathematician, just someone on this forum with some familiarity with the question.
bar1s:
I’m looking for benchmark results that compare OpenAI models specifically on mathematical reasoning.
Without more specifics, such as the area of mathematics, the level of mathematics, and the intended goal — for example, solving problems, proving results, formalizing proofs, or assisting with exploration — this is difficult to answer in a general way.
Even with those details, I would be cautious about treating any single benchmark as measuring “mathematical reasoning” in the broad sense. Benchmarks can be useful, but they usually measure performance on a particular kind of task under a particular setup.
bar1s:
Does anyone have links to … personal experience using OpenAI models for math-heavy workloads?
Two mathematicians who have occasionally written about using AI are:
- Timothy Gowers
Gowers's Weblog
Gowers's Weblog
Mathematics related discussions
A recent experience with ChatGPT 5.5 Pro
- Terence Tao
What's new
What's new
Updates on my research and expository papers, discussion of open problems, and other maths-related topics. By Terence Tao
What's new – 9 Dec 25
The story of Erdős problem #1026
Problem 1026 on the Erdős problem web site recently got solved through an interesting combination of existing literature, online collaboration, and AI tools. The purpose of this blog post is to try…
For something very recent and closer to the Lean/formalization side, see:
Jacobian challenge
That thread may be especially relevant if the question is not just about solving math problems, but about AI-assisted formalization or proof-related workflows.
Discussion in the ATmosphere