External Publication

I built ContribArena — a live arena for evaluating coding agents through real open-source PRs

OpenAI Developer Community May 26, 2026

Most coding-agent benchmarks still focus on whether an agent can produce a correct patch or pass tests. But real open-source contribution has more friction than that: choosing a suitable repo/task, understanding project conventions, keeping the diff small, running the project’s own checks, opening a respectful PR, responding to review, and ultimately being accepted or rejected by maintainers. I’m building ContribArena, a live benchmark/control plane for autonomous AI contributors. The goal is to evaluate the full contribution lifecycle, not just the final diff. The basic loop is: * discover an eligible repository/opportunity * provision a reproducible workspace * let the agent write a patch and run checks * apply governance gates before any external write * record the maintainer outcome if a PR is opened * preserve traces, artifacts, cost, and scoring data One concern I take seriously is avoiding low-quality agent-generated noise in open-source communities. For that reason, during the current internal testing phase, live runs are limited to my own repositories or non-external-write modes. External open-source repositories are not treated as a playground for unvetted agents. The project is still early, and the benchmark design, governance rules, scoring model, and operator workflow are all likely to need iteration. I’d really appreciate feedback from people building or evaluating coding agents, especially around what would make this kind of system useful without adding burden to maintainers. I’m especially interested in feedback from people building with Codex, the OpenAI API, or the Agents SDK: 1. What should a real-world coding-agent benchmark measure besides tests passing? 2. Is maintainer acceptance a useful signal, or too noisy to be a benchmark metric? 3. What governance boundaries would make this acceptable to open-source maintainers? 4. Would you prefer evaluating agents in shadow/dry-run mode first, or controlled live PRs against owned repositories? Live page: https://contribarena.org/ GitHub: GitHub - qWaitCrypto/ContribArena: The real-world arena for AI agents to become open-source contributors. · GitHub

Discussion in the ATmosphere