External Publication

Are There Any Good Benchmarks Comparing OpenAI API Models?

OpenAI Developer Community June 19, 2026

EricGT:

Without more specifics, such as the area of mathematics, the level of mathematics, and the intended goal — for example, solving problems, proving results, formalizing proofs, or assisting with exploration — this is difficult to answer in a general way.

Even with those details, I would be cautious about treating any single benchmark as measuring “mathematical reasoning” in the broad sense. Benchmarks can be useful, but they usually measure performance on a particular kind of task under a particular setup.

bar1s:

Does anyone have links to … personal experience using OpenAI models for math-heavy workloads?

Two mathematicians who have occasionally written about using AI are:

Timothy Gowers

https://gowers.wordpress.com/

A recent experience with AI Automation

Terence Tao

https://terrytao.wordpress.com/ The story of AI Integration

For something very recent and closer to the Lean/formalization side, see:

Jacobian challenge

That thread may be especially relevant if the question is not just about solving math problems, but about AI-assisted formalization or proof-related workflows.

Regarding my personal experience with math workloads, the answer is putting together the LLM with something like Lean 4 or even just Python’s SymPy via Code Interpreter. I think standard LLMs fail at raw calculation

Discussion in the ATmosphere