{
  "$type": "site.standard.document",
  "content": {
    "$type": "site.standard.content.markdown",
    "text": "There are many interesting [funding](https://gitcoin.co/) experiments happening these days: RetroPGF, ProPGF, quadratic rounds, expert juries, ML competitions, prediction markets, and anything in between.\n\nExperimentation is great. We should to try many things!\n\nThe point I made in the past is that [we need an evaluation layer for these experiments](/weight-allocation-mechanism-evals), otherwise we’re just running them on vibes and hoping they work. We change jury setups, aggregation rules, market structures, and eligibility criteria, then... just stare at the final allocation and move on. Most of the attention has gone into designing mechanism while almost none has gone into how do we tell they even work at all.\n\n## The Missing Ground Truth\n\nValues are plural. Impact can be fuzzy. Metrics get gamed. That does **not** make evaluation impossible. In fact, current mechanisms are being evaluated one way or another! [Without explicit evaluations mechanisms get judged anyway, just through opaque social processes that reward confidence, aesthetics, and insider legitimacy over evidence](https://www.jofreeman.com/joreen/tyranny.htm).\n\n> Do these mechanisms beat simpler or cheaper alternatives on the metrics/values we care about?\n\nThere may never be one canonical benchmark, but we can still build shared, falsifiable evaluation loops.\n\n## Evaluating Mechanism\n\nA mechanism might produce a decent allocation and still be a bad fit. Mechanism can be too opaque, too expensive, too easy to game, or too hard to explain. Each round/implementation requires diferent tradeoffs. Having an evaluation layer makes these explicit so the community can take better decissions and know what are they giving away by choosing one mechanism over another.\n\n### 1. Define \"bettet\"\n\nWhat metric will be used to compare mechanisms?\n\n- Agreement with holdout judgments?\n- Retrospective impact to a set of KPIs?\n- Stability across reruns?\n- Robustness to noisy evaluators?\n- Legitimacy with participants?\n- Cost per unit of improvement?\n\nThe goal is not to find a perfect metric, but to coordinate on one and iterate. The act of discussing a metric is in itself useful!\n\n### 2. Publish a Baseline\n\nNo mechanism should be discussed without a baseline alternative to compare against. \n\n- Equal split\n- Random allocation\n- Quick expert allocation\n- Simple agent based allocation\n\nThis acts as the falsifiable hypotheses. E.g: \"this mechanism beats an expert-in-an-afternoon baseline on holdout pairwise agreement\" or \"this mechanism is more stable under reruns\"\n\n### 3. Compare Blindly \n\nDo not ask people whether they like \"the Deep Funding output\" or \"the expert allocation\". Show _allocation A_ and _allocation B_ without labels. Ask which one looks better, which one looks most wrong, and why. Apps like [PGF Arena](/pgf-arena/) can make this kind of comparison easier.\n\n### 4. Analyze Errors\n\nDo not stop at leaderboard scores. Look at where the mechanism failed:\n\n- Where it strongly disagreed with evaluators\n- Where it produced obviously weird weights\n- Where baselines beat it\n- Where results were unstable under small changes\n\nThen label the failure modes: noisy raters, confussing category, missing context, popularity bias, aggregation artifacts, gaming, overconfidence.\n\nThese evaluations should be public, reproducible, and forkable: data, scoring, rules, and outputs should be inspectable by anyone for this process to be credible neutral.\n\n## Conclusion\n\nWith a principled approach, the output of a round is not only final allocation anymore. It is also the **cumulative learnings** the scientific method enables!\n\nPublic goods funding does need experimentation. It may not need more mechanisms until we know how well the ones we already have work. There is still no widely accepted evaluation loop for comparing these mechanisms. That itself is a great public good we should strive for!",
    "version": "1.0"
  },
  "description": "There are many interesting funding experiments happening these days: RetroPGF, ProPGF, quadratic rounds, expert juries, ML competitions, prediction markets, and anything in between. Experimentation is great. We should to try many things! The point I made in the past is that we...",
  "path": "/public-goods-funding-needs-evals",
  "publishedAt": "2026-03-20T00:00:00.000Z",
  "site": "at://did:plc:4z5i7njrld66ew36htufcwry/site.standard.publication/3mo43d2tmt2ov",
  "textContent": "There are many interesting funding experiments happening these days: RetroPGF, ProPGF, quadratic rounds, expert juries, ML competitions, prediction markets, and anything in between.\n\nExperimentation is great. We should to try many things!\n\nThe point I made in the past is that we need an evaluation layer for these experiments, otherwise we’re just running them on vibes and hoping they work. We change jury setups, aggregation rules, market structures, and eligibility criteria, then... just stare at the final allocation and move on. Most of the attention has gone into designing mechanism while almost none has gone into how do we tell they even work at all.\n\nThe Missing Ground Truth\n\nValues are plural. Impact can be fuzzy. Metrics get gamed. That does not make evaluation impossible. In fact, current mechanisms are being evaluated one way or another! Without explicit evaluations mechanisms get judged anyway, just through opaque social processes that reward confidence, aesthetics, and insider legitimacy over evidence.\nDo these mechanisms beat simpler or cheaper alternatives on the metrics/values we care about?\n\nThere may never be one canonical benchmark, but we can still build shared, falsifiable evaluation loops.\n\nEvaluating Mechanism\n\nA mechanism might produce a decent allocation and still be a bad fit. Mechanism can be too opaque, too expensive, too easy to game, or too hard to explain. Each round/implementation requires diferent tradeoffs. Having an evaluation layer makes these explicit so the community can take better decissions and know what are they giving away by choosing one mechanism over another.\nDefine \"bettet\"\n\nWhat metric will be used to compare mechanisms?\nAgreement with holdout judgments?\nRetrospective impact to a set of KPIs?\nStability across reruns?\nRobustness to noisy evaluators?\nLegitimacy with participants?\nCost per unit of improvement?\n\nThe goal is not to find a perfect metric, but to coordinate on one and iterate. The act of discussing a metric is in itself useful!\nPublish a Baseline\n\nNo mechanism should be discussed without a baseline alternative to compare against. \nEqual split\nRandom allocation\nQuick expert allocation\nSimple agent based allocation\n\nThis acts as the falsifiable hypotheses. E.g: \"this mechanism beats an expert-in-an-afternoon baseline on holdout pairwise agreement\" or \"this mechanism is more stable under reruns\"\nCompare Blindly \n\nDo not ask people whether they like \"the Deep Funding output\" or \"the expert allocation\". Show allocation A and allocation B without labels. Ask which one looks better, which one looks most wrong, and why. Apps like PGF Arena can make this kind of comparison easier.\nAnalyze Errors\n\nDo not stop at leaderboard scores. Look at where the mechanism failed:\nWhere it strongly disagreed with evaluators\nWhere it produced obviously weird weights\nWhere baselines beat it\nWhere results were unstable under small changes\n\nThen label the failure modes: noisy raters, confussing category, missing context, popularity bias, aggregation artifacts, gaming, overconfidence.\n\nThese evaluations should be public, reproducible, and forkable: data, scoring, rules, and outputs should be inspectable by anyone for this process to be credible neutral.\n\nConclusion\n\nWith a principled approach, the output of a round is not only final allocation anymore. It is also the cumulative learnings the scientific method enables!\n\nPublic goods funding does need experimentation. It may not need more mechanisms until we know how well the ones we already have work. There is still no widely accepted evaluation loop for comparing these mechanisms. That itself is a great public good we should strive for!",
  "title": "Public Goods Funding Needs Evals"
}