Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreih34prc72qku6ufjxxqgp4un5xykr5a5xkxu3vtd5obrpzkaxl2l4",
    "uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3monhzubnaai2"
  },
  "path": "/t/are-there-any-good-benchmarks-comparing-openai-api-models/1383961#post_5",
  "publishedAt": "2026-06-19T13:21:41.000Z",
  "site": "https://community.openai.com",
  "tags": [
    "Lean 4 Zulip forum"
  ],
  "textContent": "Mark_Mcdoe:\n\n> the answer is putting together the LLM with something like Lean 4\n\nIf you are not already a member of the Lean 4 Zulip forum, it is worth joining. It is free, and the Lean community is in a much better position to help with questions about combining LLMs with Lean 4.\n\nMark_Mcdoe:\n\n> I think standard LLMs fail at raw calculation\n\nWhat do you mean by “raw calculation” here?\n\nDo you mean arithmetic computation, symbolic manipulation, multi-step mathematical derivations, formal proof construction, or something else? Those are different tasks, and LLMs tend to fail in different ways depending on which one you mean.",
  "title": "Are There Any Good Benchmarks Comparing OpenAI API Models?"
}