Raw Record Source

{
  "$type": "site.standard.document",
  "content": {
    "$type": "site.standard.content.markdown",
    "text": "> **TL;DR:** This post explores using a multi-agent systems for ranking tasks. Check out [Arbitron](https://github.com/davidgasquez/arbitron) if you want to see a working implementation of the pattern/ideas.\n\n\nOne of the latest [Kaggle style competitions](https://github.com/deepfunding/) I've [been participating in](/steering-ais) got me thinking about the difficulties involved in collecting accurate and relevant preferences from humans and aggregating them in somewhat consistent rankings or weight distributions.\n\nI did some research around this general issue and, at the same time, [worked on a small tool to explore a potential approach for the competition](https://x.com/davidgasquez/status/1941525990024544418).\n\n> **Can a bunch of LLM Agents be used to rank an arbitrary set of items in a consistent way?**\n\nA couple of weeks later, I had the chance to attend the [Impact Evaluator Research Retreat](https://www.researchretreat.org/ierr-2025/) and, in the first few days, realized the idea was a perfect residency project.\n\nI had the opportunity to explore this idea further and this post explores the main learnings!\n\n<figure style=\"margin: 1em 0;\">\n  <img src=\"https://www.developerweek.com/wp-content/uploads/2024/12/DeveloperWeek-2025-Hackathon_featured.jpg\" alt=\"contest\" style=\"width: 100%; height: auto;\" />\n  <figcaption style=\"text-align: center; font-style: italic; color: #666; margin-top: 0.0em; font-size: 0.9em;\">Ranking participants in large hackathons is no joke!</figcaption>\n</figure>\n\n## Problem\n\nThis is the general version of the problem.\n\n> **Given an evaluation criteria and an arbitrary set of items, how can we produce the highest-quality judging results?**\n\nIt's a very common problem as you can imagine. You'll encounter it when jurors have to evaluate submissions in large hackathons or humans have to rank LLM responses based on \"usefulness\".\n\nThe naive (and _unfortunately_ the most common) approach is to ask humans to rate each item using a [Likert scale](https://en.wikipedia.org/wiki/Likert_scale) or similar. This has several issues:\n\n- Every juror has a different scale and interpretation of the scale. Your 6 might be my 3.\n- Without knowing the entire population, the first ratings are arbitrary as jurors don't have a global view.\n- Humans excel at relative judgments, but struggle with absolute judgments. Is this a 7.3 or an 8.6?\n- Also, humans are not very consistent. An 8 now could have been a 7 this morning.\n\nAs many people have realized long before me, [there are better ways](https://anishathalye.com/designing-a-better-judging-system/) to rank items (e.g. [chocolate](https://medium.com/@florian_32814/ten-kilograms-of-chocolate-75c4fa3492b6)). Let's look at one interesting approach.\n\n## Simplify Decisions with Pairwise Comparisons\n\nThis is probably one of the simplest solutions. **Evaluate or rank the items by making jurors do pairwise comparisons between random items**. This helps in several ways:\n\n- Avoid the issues of absolute ratings. Is A better than B? Yes or no?\n- Reduce the cognitive load on jurors. They only need to compare two items at a time.\n- Allow jurors to focus on the differences between items. Which one is better on the property X?\n- Avoid some of the biases of absolute ratings.\n- Capture qualitative information better.\n- Make it easier to aggregate the results. No need to normalize or standardize ratings across jurors. Once you have the pairwise comparisons, you can use multiple algorithms to derive a ranking or weight distribution.\n- Reduce the impact of outliers.\n\nQuite cool! [Pairwise preferences](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3359677) is something that has been explored in the literature, but still feels underused in practice.\n\n## In 2025 you have to use LLMs\n\nSo, LLMs can surely help with the judging/ranking process, right? Well, they can, but there are some challenges if you approach this challenge naively (using one LLM to rank all items).\n\n- Loading all candidates/items in one long context will confuse the model.\n- Running the same prompt twice will give you inconsistent results.\n- Any prompt engineering you do will affect the results.\n- The model is exposed to prompt injection.\n\nAnother approach is to have multiple agents and allow them to collaborate and talk between each other. I don't have any data to back this up but I think they'll probably lose track of the context and it would be more expensive. The first opinion will be very influential!\n\nWe've learned a better way though! What if we could have **multiple LLMs (agents) that are specialized in evaluating items based on different criteria**? Each agent could focus on a specific aspect of the items, and then we could aggregate their results.\n\nThat is basically what I set out to explore with [Arbitron](https://github.com/davidgasquez/arbitron) after realizing that using standalone LLMs with long context wasn't ideal for the competition I was working on.\n\n## Arbitron\n\n[Arbitron](https://github.com/davidgasquez/arbitron) is _\"a multi-agent consensus ranking system to derive optimal weights through pairwise comparisons\"_. Sounds [more complex than it is](https://github.com/davidgasquez/arbitron/blob/main/examples/simple.py). Think of it as a framework to define agents (LLMs in a loop with tools) that evaluate items based on different criteria. The results then get aggregated to produce a final ranking or weight distribution.\n\n```python\nimport arbitron\n\nmovies = [\n    arbitron.Item(id=\"arrival\"),\n    arbitron.Item(id=\"blade_runner\"),\n    arbitron.Item(id=\"interstellar\"),\n    arbitron.Item(id=\"inception\"),\n    arbitron.Item(id=\"the_dark_knight\"),\n    arbitron.Item(id=\"dune\"),\n    arbitron.Item(id=\"the_matrix\"),\n    arbitron.Item(id=\"2001_space_odyssey\"),\n    arbitron.Item(id=\"the_fifth_element\"),\n    arbitron.Item(id=\"the_martian\"),\n]\n\nagents = [\n    arbitron.Agent(\n        id=\"SciFi Purist\",\n        prompt=\"Compare based on scientific accuracy and hard sci-fi concepts.\",\n        model=\"google-gla:gemini-2.5-flash\",\n    ),\n    arbitron.Agent(\n        id=\"Nolan Fan\",\n        prompt=\"Compare based on complex narratives and emotional depth.\",\n        model=\"groq:qwen/qwen3-32b\",\n    ),\n    arbitron.Agent(\n        id=\"Critics Choice\",\n        prompt=\"Compare based on artistic merit and cinematic excellence.\",\n        model=\"openai:gpt-4.1-nano\",\n    ),\n]\n\ndescription = \"Rank the movies based on their soundtrack quality.\"\n\ncomparisons = arbitron.run(description, agents, movies)\nranking = arbitron.rank(comparisons)\n```\n\nThe previous code will give you a ranking of the movies based on the criteria defined in the `description`! It uses [PydanticAI](https://ai.pydantic.dev/) for the LLM things and [choix](https://github.com/lucasmaystre/choix/) for the ranking algorithms. Some of my favorite features of Arbitron are:\n\n- Supports arbitrary comparisons (text, code, pictures) thanks to multimodal LLMs.\n- Customizable agents with unique personas, tools, providers.\n- Ranking algorithm independent. Use whichever algorithm you prefer (Elo, Bradley-Terry, etc.).\n- [Wisdom of the crowds stability](https://x.com/hwchase17/status/1796269356625875049) and some slight bias reduction as you mix providers and LLMs.\n  - The key advantage is reducing single-point bias while maintaining explainability through agent reasoning traces. It's particularly powerful where \"correctness\" is multifaceted and subjective consensus adds value.\n- Access the raw data. Mix it with human comparisons, see the reasoning of why item D lost against item A, ...\n  - Reasoning / interpretability gives transparency.\n- Cheap and embarrassingly parallelizable.\n  - Use cheaper models while preserving quality.\n  - Comparing is cheaper and easier than writing (output tokens).\n  - Run across providers, machines, ...\n\n### Evaluations\n\nI've done a [couple of local experiments](https://github.com/davidgasquez/arbitron/tree/main/src/arbitron/evals) to evaluate the performance of Arbitron. I compared the [ranking accuracy (Kendall Tau)](https://github.com/davidgasquez/arbitron/blob/main/src/arbitron/evals/scoring.py) of different systems:\n\n- Arbitron with 3 small agents (a.k.a Arbitrinity [^1])\n- Arbitron with 10 small agents (a.k.a Arbiten)\n- Gemini 2.5 Flash and Pro\n- Claude Sonnet 4\n- OpenAI o3-high\n- ChatGPT (4o)\n- Many smaller models\n- Arbitron with 3 frontier agents (a.k.a Arbitrinity Max)\n- Arbitron with 10 frontier agents (a.k.a Arbiten Max)\n\nThe first eval is to make agents choose [which movie was released earlier](https://github.com/davidgasquez/arbitron/blob/main/src/arbitron/evals/movies.py). This should produce a sorted list of movies.\n\nIn this simple example, most of the models got things right! The only ones that didn't were ChatGPT, GPT 4.1 and the small models. This type of knowledge is easily available inside the LLMs so they didn't have many problems to retrieve it.\n\nThe second eval is trickier. The goal is to [rank Wikipedia articles based on their popularity](https://github.com/davidgasquez/arbitron/blob/main/src/arbitron/evals/wiki.py) (cumulative number of page views since 2007). Now things get interesting as [this data is not common in their corpus](https://en.wikipedia.org/wiki/Wikipedia:Popular_pages).\n\n> 🎮 [Check how you score in the same benchmark](https://davidgasquez.com/experiments/wikigame/)\n\nHere are the scores of the latest run. The higher the Kendall Tau score, the better the ranking.\n\n| Model            | Kendall Tau Score |\n| :--------------- | :---------------- |\n| Arbitrinity      | 0.2               |\n| Arbiten          | 0.15              |\n| Gemini 2.5 Pro   | -0.24             |\n| Gemini 2.5 Flash | -0.33             |\n| Opus             | 0.28              |\n| GPT 4.1          | -0.06             |\n| GPT o3           | 0.33              |\n| Arbitrinity Max  | 0.16              |\n| Arbiten Max      | 0.12              |\n\nEven in this simple example with only 10 items (not taking a lot of context), Arbitron usually outperformed single models, except for Opus and o3. Interestingly, having 10 agents didn't seem to improve the results!\n\nAnecdotically, the scores from Arbitron also seemed more consistent across runs ([others have noticed this previously](https://x.com/hwchase17/status/1796269356625875049)). More research is definitely needed!\n\nA few interesting questions are worth still exploring:\n\n- How does it compare to one human?\n- How does it compare to a bunch of humans?\n- Can we improve human accuracy by adding \"agent\" comparisons in the mix?\n\n## Learnings\n\nThe biggest learning for me has been the improved intuition of why using pairwise comparisons works great in this context. I've also learned many things on the state of the art around using pairwise comparisons when training and evaluating LLMs (a common [approach since 2017](https://arxiv.org/abs/1706.03741)). There is a lot of literature around [aligning LLMs with human judgement using pairwise comparisons](https://arxiv.org/abs/2403.16950) that I wasn't aware of before!\n\nLots of these ideas are the base of [how RLHF works](https://arxiv.org/html/2505.11864v1#S3) these days. Modern RLHF practices (e.g. [pairwise reranker](https://www.zeroentropy.dev/articles/improving-retrieval-with-elo-scores)) use preference data rather than absolute due to the advantages of pairwise comparisons shared earlier. Chatbot Arena (which ranks all major LLMs) is entirely based on pairwise comparisons. People [building LLMs are relying on this](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/).\n\nAnother important realization is how much interfaces and UX matters. Both for Humans and LLMs. While doing the evaluations, I could feel how much the prompt design affected results! E.g: there is a strong verbosity bias where longer responses often win.\n\nFinally, this approach is not a silver bullet. Every context will have different requirements and while Arbitron style systems typically get the direction right, the magnitude/weights it comes up with may be noisy.\n\n## Uses\n\nThis approach shines where there is some subjectivity to it that is hard to measure (in cases where an objective answer exists, the agents could use a tool and they'll get it). Here are some areas that come to mind:\n\n- Any kind of [Impact Evaluator](https://davidgasquez.com/handbook/impact-evaluators/) that has to attribute weights to every claim.\n- Filtering items based on some criteria. E.g: grants proposals, hackathon projects, ...\n- Coming up with a custom distribution function on top of the weights. E.g: force or avoid a peanut butter spread.\n- Processes where prompt injection is common. Multiple agents from different providers make it much harder to game the system.\n- Places where some plurality of opinions is needed. Different agents can represent different values/stakeholders in the evaluation.\n\n## Future\n\nOf course, I have a long list of things I'd like to continue exploring. There are many obvious improvements to the current tool like making it a web app, but also more interesting research questions:\n\n- Is it better to have one contest or many contests with different values?\n- How to make it more neutral and transparent?\n- How resistant is it to adversarial attacks?\n- What if each juror had a \"custom research agent\" it could trigger to dig into the question without affecting its own context?\n- Inspired by DSPy, what if each comparison had a [paraphrasing of the goal description instead of the description itself](https://arxiv.org/abs/2406.11370)?\n- Is it better to allow ties where both options may be equally good?\n- What is the impact of using a different algorithm like PageRank or TrueSkill?\n\nOverall, it was a very fun project and I'm very happy with the results (not so much with the costs [^2]).\n\n<div style=\"display: flex; gap: 10px;\">\n<img src=\"/images/ranking-agents-claude-max.png\" alt=\"Claude\" style=\"max-width: 350px; max-height: 350px;\" />\n<img src=\"/images/ranking-agents-gemini.png\" alt=\"Gemini\" style=\"max-width: 500px; max-height: 350px;\" />\n</div>\n\nBefore wrapping up, I wanted to leave with a meta reflection. Arbitron name was [decided by the tool itself](https://x.com/davidgasquez/status/1942164800487788579/photo/1) after I asked it to rank a bunch of names. I later realized I don't like the name Arbitron. The meta-lesson here being that **sometimes the more important thing is not better mechanisms for the final rank, but better mechanisms for discussing and coordinating what to propose in the first place**.\n\n## Acknowledgements\n\n- [DeepGov](https://www.deepgov.org/) and their use of AI for Democratic Capital Allocation and Governance.\n- [Daniel Kronovet](https://kronosapiens.github.io/) for his many writings on the power of pairwise comparisons.\n- [Deep Funding Competition](https://github.com/deepfunding/)\n\n[^1]: Naming is not my strong suit.\n[^2]: Always set some threshold on your LLM providers!",
    "version": "1.0"
  },
  "description": "TL;DR: This post explores using a multi-agent systems for ranking tasks. Check out Arbitron if you want to see a working implementation of the pattern/ideas. One of the latest Kaggle style competitions I've been participating in got me thinking about the difficulties involved...",
  "path": "/ranking-with-agents",
  "publishedAt": "2025-08-06T00:00:00.000Z",
  "site": "at://did:plc:4z5i7njrld66ew36htufcwry/site.standard.publication/3mo43d2tmt2ov",
  "textContent": "TL;DR: This post explores using a multi-agent systems for ranking tasks. Check out Arbitron if you want to see a working implementation of the pattern/ideas.\n\nOne of the latest Kaggle style competitions I've been participating in got me thinking about the difficulties involved in collecting accurate and relevant preferences from humans and aggregating them in somewhat consistent rankings or weight distributions.\n\nI did some research around this general issue and, at the same time, worked on a small tool to explore a potential approach for the competition.\nCan a bunch of LLM Agents be used to rank an arbitrary set of items in a consistent way?\n\nA couple of weeks later, I had the chance to attend the Impact Evaluator Research Retreat and, in the first few days, realized the idea was a perfect residency project.\n\nI had the opportunity to explore this idea further and this post explores the main learnings!\n\n  \n  Ranking participants in large hackathons is no joke!\n\nProblem\n\nThis is the general version of the problem.\nGiven an evaluation criteria and an arbitrary set of items, how can we produce the highest-quality judging results?\n\nIt's a very common problem as you can imagine. You'll encounter it when jurors have to evaluate submissions in large hackathons or humans have to rank LLM responses based on \"usefulness\".\n\nThe naive (and unfortunately the most common) approach is to ask humans to rate each item using a Likert scale or similar. This has several issues:\nEvery juror has a different scale and interpretation of the scale. Your 6 might be my 3.\nWithout knowing the entire population, the first ratings are arbitrary as jurors don't have a global view.\nHumans excel at relative judgments, but struggle with absolute judgments. Is this a 7.3 or an 8.6?\nAlso, humans are not very consistent. An 8 now could have been a 7 this morning.\n\nAs many people have realized long before me, there are better ways to rank items (e.g. chocolate). Let's look at one interesting approach.\n\nSimplify Decisions with Pairwise Comparisons\n\nThis is probably one of the simplest solutions. Evaluate or rank the items by making jurors do pairwise comparisons between random items. This helps in several ways:\nAvoid the issues of absolute ratings. Is A better than B? Yes or no?\nReduce the cognitive load on jurors. They only need to compare two items at a time.\nAllow jurors to focus on the differences between items. Which one is better on the property X?\nAvoid some of the biases of absolute ratings.\nCapture qualitative information better.\nMake it easier to aggregate the results. No need to normalize or standardize ratings across jurors. Once you have the pairwise comparisons, you can use multiple algorithms to derive a ranking or weight distribution.\nReduce the impact of outliers.\n\nQuite cool! Pairwise preferences is something that has been explored in the literature, but still feels underused in practice.\n\nIn 2025 you have to use LLMs\n\nSo, LLMs can surely help with the judging/ranking process, right? Well, they can, but there are some challenges if you approach this challenge naively (using one LLM to rank all items).\nLoading all candidates/items in one long context will confuse the model.\nRunning the same prompt twice will give you inconsistent results.\nAny prompt engineering you do will affect the results.\nThe model is exposed to prompt injection.\n\nAnother approach is to have multiple agents and allow them to collaborate and talk between each other. I don't have any data to back this up but I think they'll probably lose track of the context and it would be more expensive. The first opinion will be very influential!\n\nWe've learned a better way though! What if we could have multiple LLMs (agents) that are specialized in evaluating items based on different criteria? Each agent could focus on a specific aspect of the items, and then we could aggregate their results.\n\nThat is basically what I set out to explore with Arbitron after realizing that using standalone LLMs with long context wasn't ideal for the competition I was working on.\n\nArbitron\n\nArbitron is \"a multi-agent consensus ranking system to derive optimal weights through pairwise comparisons\". Sounds more complex than it is. Think of it as a framework to define agents (LLMs in a loop with tools) that evaluate items based on different criteria. The results then get aggregated to produce a final ranking or weight distribution.\n\nThe previous code will give you a ranking of the movies based on the criteria defined in the description! It uses PydanticAI for the LLM things and choix for the ranking algorithms. Some of my favorite features of Arbitron are:\nSupports arbitrary comparisons (text, code, pictures) thanks to multimodal LLMs.\nCustomizable agents with unique personas, tools, providers.\nRanking algorithm independent. Use whichever algorithm you prefer (Elo, Bradley-Terry, etc.).\nWisdom of the crowds stability and some slight bias reduction as you mix providers and LLMs.\nThe key advantage is reducing single-point bias while maintaining explainability through agent reasoning traces. It's particularly powerful where \"correctness\" is multifaceted and subjective consensus adds value.\nAccess the raw data. Mix it with human comparisons, see the reasoning of why item D lost against item A, ...\nReasoning / interpretability gives transparency.\nCheap and embarrassingly parallelizable.\nUse cheaper models while preserving quality.\nComparing is cheaper and easier than writing (output tokens).\nRun across providers, machines, ...\n\nEvaluations\n\nI've done a couple of local experiments to evaluate the performance of Arbitron. I compared the ranking accuracy (Kendall Tau) of different systems:\nArbitron with 3 small agents (a.k.a Arbitrinity 1)\nArbitron with 10 small agents (a.k.a Arbiten)\nGemini 2.5 Flash and Pro\nClaude Sonnet 4\nOpenAI o3-high\nChatGPT (4o)\nMany smaller models\nArbitron with 3 frontier agents (a.k.a Arbitrinity Max)\nArbitron with 10 frontier agents (a.k.a Arbiten Max)\n\nThe first eval is to make agents choose which movie was released earlier. This should produce a sorted list of movies.\n\nIn this simple example, most of the models got things right! The only ones that didn't were ChatGPT, GPT 4.1 and the small models. This type of knowledge is easily available inside the LLMs so they didn't have many problems to retrieve it.\n\nThe second eval is trickier. The goal is to rank Wikipedia articles based on their popularity (cumulative number of page views since 2007). Now things get interesting as this data is not common in their corpus.\n🎮 Check how you score in the same benchmark\n\nHere are the scores of the latest run. The higher the Kendall Tau score, the better the ranking.\n\n| Model            | Kendall Tau Score |\n| :--------------- | :---------------- |\n| Arbitrinity      | 0.2               |\n| Arbiten          | 0.15              |\n| Gemini 2.5 Pro   | -0.24             |\n| Gemini 2.5 Flash | -0.33             |\n| Opus             | 0.28              |\n| GPT 4.1          | -0.06             |\n| GPT o3           | 0.33              |\n| Arbitrinity Max  | 0.16              |\n| Arbiten Max      | 0.12              |\n\nEven in this simple example with only 10 items (not taking a lot of context), Arbitron usually outperformed single models, except for Opus and o3. Interestingly, having 10 agents didn't seem to improve the results!\n\nAnecdotically, the scores from Arbitron also seemed more consistent across runs (others have noticed this previously). More research is definitely needed!\n\nA few interesting questions are worth still exploring:\nHow does it compare to one human?\nHow does it compare to a bunch of humans?\nCan we improve human accuracy by adding \"agent\" comparisons in the mix?\n\nLearnings\n\nThe biggest learning for me has been the improved intuition of why using pairwise comparisons works great in this context. I've also learned many things on the state of the art around using pairwise comparisons when training and evaluating LLMs (a common approach since 2017). There is a lot of literature around aligning LLMs with human judgement using pairwise comparisons that I wasn't aware of before!\n\nLots of these ideas are the base of how RLHF works these days. Modern RLHF practices (e.g. pairwise reranker) use preference data rather than absolute due to the advantages of pairwise comparisons shared earlier. Chatbot Arena (which ranks all major LLMs) is entirely based on pairwise comparisons. People building LLMs are relying on this.\n\nAnother important realization is how much interfaces and UX matters. Both for Humans and LLMs. While doing the evaluations, I could feel how much the prompt design affected results! E.g: there is a strong verbosity bias where longer responses often win.\n\nFinally, this approach is not a silver bullet. Every context will have different requirements and while Arbitron style systems typically get the direction right, the magnitude/weights it comes up with may be noisy.\n\nUses\n\nThis approach shines where there is some subjectivity to it that is hard to measure (in cases where an objective answer exists, the agents could use a tool and they'll get it). Here are some areas that come to mind:\nAny kind of Impact Evaluator that has to attribute weights to every claim.\nFiltering items based on some criteria. E.g: grants proposals, hackathon projects, ...\nComing up with a custom distribution function on top of the weights. E.g: force or avoid a peanut butter spread.\nProcesses where prompt injection is common. Multiple agents from different providers make it much harder to game the system.\nPlaces where some plurality of opinions is needed. Different agents can represent different values/stakeholders in the evaluation.\n\nFuture\n\nOf course, I have a long list of things I'd like to continue exploring. There are many obvious improvements to the current tool like making it a web app, but also more interesting research questions:\nIs it better to have one contest or many contests with different values?\nHow to make it more neutral and transparent?\nHow resistant is it to adversarial attacks?\nWhat if each juror had a \"custom research agent\" it could trigger to dig into the question without affecting its own context?\nInspired by DSPy, what if each comparison had a paraphrasing of the goal description instead of the description itself?\nIs it better to allow ties where both options may be equally good?\nWhat is the impact of using a different algorithm like PageRank or TrueSkill?\n\nOverall, it was a very fun project and I'm very happy with the results (not so much with the costs 2).\n\nBefore wrapping up, I wanted to leave with a meta reflection. Arbitron name was decided by the tool itself after I asked it to rank a bunch of names. I later realized I don't like the name Arbitron. The meta-lesson here being that sometimes the more important thing is not better mechanisms for the final rank, but better mechanisms for discussing and coordinating what to propose in the first place.\n\nAcknowledgements\nDeepGov and their use of AI for Democratic Capital Allocation and Governance.\nDaniel Kronovet for his many writings on the power of pairwise comparisons.\nDeep Funding Competition\n\n1: Naming is not my strong suit.\n2: Always set some threshold on your LLM providers!",
  "title": "Ranking with Agents"
}