Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreic3pnipcxdr4quxbzv3iwwpeqsvb7htekn4j5j7gaxcyh5iwssb6q",
    "uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mohzemqnoyz2"
  },
  "path": "/t/are-there-any-good-benchmarks-comparing-openai-api-models/1383961#post_3",
  "publishedAt": "2026-06-17T09:07:12.000Z",
  "site": "https://community.openai.com",
  "tags": [
    "@PaulBellow",
    "Gowers's Weblog",
    "A recent experience with ChatGPT 5.5 Pro",
    "What's new",
    "What's new – 9 Dec 25",
    "The story of Erdős problem #1026",
    "Jacobian challenge"
  ],
  "textContent": "Thanks @PaulBellow\n\nA note up front: I am not a mathematician, just someone on this forum with some familiarity with the question.\n\nbar1s:\n\n> I’m looking for benchmark results that compare OpenAI models specifically on mathematical reasoning.\n\nWithout more specifics, such as the area of mathematics, the level of mathematics, and the intended goal — for example, solving problems, proving results, formalizing proofs, or assisting with exploration — this is difficult to answer in a general way.\n\nEven with those details, I would be cautious about treating any single benchmark as measuring “mathematical reasoning” in the broad sense. Benchmarks can be useful, but they usually measure performance on a particular kind of task under a particular setup.\n\n* * *\n\nbar1s:\n\n> Does anyone have links to … personal experience using OpenAI models for math-heavy workloads?\n\nTwo mathematicians who have occasionally written about using AI are:\n\n  * Timothy Gowers\n\nGowers's Weblog\n\n### Gowers's Weblog\n\nMathematics related discussions\n\n**A recent experience with ChatGPT 5.5 Pro**\n\n  * Terence Tao\n\nWhat's new\n\n### What's new\n\nUpdates on my research and expository papers, discussion of open problems, and other maths-related topics. By Terence Tao\n\nWhat's new – 9 Dec 25\n\n### The story of Erdős problem #1026\n\nProblem 1026 on the Erdős problem web site recently got solved through an interesting combination of existing literature, online collaboration, and AI tools. The purpose of this blog post is to try…\n\n* * *\n\nFor something very recent and closer to the Lean/formalization side, see:\n\n**Jacobian challenge**\n\nThat thread may be especially relevant if the question is not just about solving math problems, but about AI-assisted formalization or proof-related workflows.",
  "title": "Are There Any Good Benchmarks Comparing OpenAI API Models?"
}