Are There Any Good Benchmarks Comparing OpenAI API Models?

OpenAI Developer Community June 17, 2026

Source

Thanks @PaulBellow

A note up front: I am not a mathematician, just someone on this forum with some familiarity with the question.

bar1s:

I’m looking for benchmark results that compare OpenAI models specifically on mathematical reasoning.

Without more specifics, such as the area of mathematics, the level of mathematics, and the intended goal — for example, solving problems, proving results, formalizing proofs, or assisting with exploration — this is difficult to answer in a general way.

Even with those details, I would be cautious about treating any single benchmark as measuring “mathematical reasoning” in the broad sense. Benchmarks can be useful, but they usually measure performance on a particular kind of task under a particular setup.

bar1s:

Does anyone have links to … personal experience using OpenAI models for math-heavy workloads?

Two mathematicians who have occasionally written about using AI are:

Timothy Gowers

Gowers's Weblog

Mathematics related discussions

A recent experience with ChatGPT 5.5 Pro

Terence Tao

What's new

Updates on my research and expository papers, discussion of open problems, and other maths-related topics. By Terence Tao

What's new – 9 Dec 25

The story of Erdős problem #1026

Problem 1026 on the Erdős problem web site recently got solved through an interesting combination of existing literature, online collaboration, and AI tools. The purpose of this blog post is to try…

For something very recent and closer to the Lean/formalization side, see:

Jacobian challenge

That thread may be especially relevant if the question is not just about solving math problems, but about AI-assisted formalization or proof-related workflows.

Gowers's Weblog

What's new

The story of Erdős problem #1026

Discussion in the ATmosphere