Are There Any Good Benchmarks Comparing OpenAI API Models?
EricGT:
Without more specifics, such as the area of mathematics, the level of mathematics, and the intended goal — for example, solving problems, proving results, formalizing proofs, or assisting with exploration — this is difficult to answer in a general way.
Even with those details, I would be cautious about treating any single benchmark as measuring “mathematical reasoning” in the broad sense. Benchmarks can be useful, but they usually measure performance on a particular kind of task under a particular setup.
bar1s:
Does anyone have links to … personal experience using OpenAI models for math-heavy workloads?
Two mathematicians who have occasionally written about using AI are:
- Timothy Gowers
A recent experience with AI Automation
- Terence Tao
https://terrytao.wordpress.com/ The story of AI Integration
For something very recent and closer to the Lean/formalization side, see:
Jacobian challenge
That thread may be especially relevant if the question is not just about solving math problems, but about AI-assisted formalization or proof-related workflows.
Regarding my personal experience with math workloads, the answer is putting together the LLM with something like Lean 4 or even just Python’s SymPy via Code Interpreter. I think standard LLMs fail at raw calculation
Discussion in the ATmosphere