{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreidriylixb4ylxyog52vk3fxmxteug3jpfi4h6gwxkpmjhsbd3syqq",
"uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3monhzuse7z42"
},
"path": "/t/are-there-any-good-benchmarks-comparing-openai-api-models/1383961#post_4",
"publishedAt": "2026-06-19T13:15:01.000Z",
"site": "https://community.openai.com",
"tags": [
"https://gowers.wordpress.com/",
"A recent experience with AI Automation",
"https://terrytao.wordpress.com/",
"The story of AI Integration",
"Jacobian challenge"
],
"textContent": "EricGT:\n\n> Without more specifics, such as the area of mathematics, the level of mathematics, and the intended goal — for example, solving problems, proving results, formalizing proofs, or assisting with exploration — this is difficult to answer in a general way.\n>\n> Even with those details, I would be cautious about treating any single benchmark as measuring “mathematical reasoning” in the broad sense. Benchmarks can be useful, but they usually measure performance on a particular kind of task under a particular setup.\n>\n> * * *\n>\n> bar1s:\n>\n>> Does anyone have links to … personal experience using OpenAI models for math-heavy workloads?\n>\n> Two mathematicians who have occasionally written about using AI are:\n>\n> * Timothy Gowers\n>\n\n>\n> https://gowers.wordpress.com/\n>\n> **A recent experience with AI Automation**\n>\n> * Terence Tao\n>\n\n>\n> https://terrytao.wordpress.com/ The story of AI Integration\n>\n> * * *\n>\n> For something very recent and closer to the Lean/formalization side, see:\n>\n> **Jacobian challenge**\n>\n> That thread may be especially relevant if the question is not just about solving math problems, but about AI-assisted formalization or proof-related workflows.\n\nRegarding my personal experience with math workloads, the answer is putting together the LLM with something like Lean 4 or even just Python’s SymPy via Code Interpreter. I think standard LLMs fail at raw calculation",
"title": "Are There Any Good Benchmarks Comparing OpenAI API Models?"
}