Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidriylixb4ylxyog52vk3fxmxteug3jpfi4h6gwxkpmjhsbd3syqq",
    "uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3monhzuse7z42"
  },
  "path": "/t/are-there-any-good-benchmarks-comparing-openai-api-models/1383961#post_4",
  "publishedAt": "2026-06-19T13:15:01.000Z",
  "site": "https://community.openai.com",
  "tags": [
    "https://gowers.wordpress.com/",
    "A recent experience with AI Automation",
    "https://terrytao.wordpress.com/",
    "The story of AI Integration",
    "Jacobian challenge"
  ],
  "textContent": "EricGT:\n\n> Without more specifics, such as the area of mathematics, the level of mathematics, and the intended goal — for example, solving problems, proving results, formalizing proofs, or assisting with exploration — this is difficult to answer in a general way.\n>\n> Even with those details, I would be cautious about treating any single benchmark as measuring “mathematical reasoning” in the broad sense. Benchmarks can be useful, but they usually measure performance on a particular kind of task under a particular setup.\n>\n> * * *\n>\n> bar1s:\n>\n>> Does anyone have links to … personal experience using OpenAI models for math-heavy workloads?\n>\n> Two mathematicians who have occasionally written about using AI are:\n>\n>   * Timothy Gowers\n>\n\n>\n> https://gowers.wordpress.com/\n>\n> **A recent experience with AI Automation**\n>\n>   * Terence Tao\n>\n\n>\n> https://terrytao.wordpress.com/ The story of AI Integration\n>\n> * * *\n>\n> For something very recent and closer to the Lean/formalization side, see:\n>\n> **Jacobian challenge**\n>\n> That thread may be especially relevant if the question is not just about solving math problems, but about AI-assisted formalization or proof-related workflows.\n\nRegarding my personal experience with math workloads, the answer is putting together the LLM with something like Lean 4 or even just Python’s SymPy via Code Interpreter. I think standard LLMs fail at raw calculation",
  "title": "Are There Any Good Benchmarks Comparing OpenAI API Models?"
}