Raw Record Source

{
  "$type": "site.standard.document",
  "content": {
    "$type": "site.standard.content.markdown",
    "text": "Open datasets are everywhere. Maintained datasets are rare.\n\nI keep seeing the same pattern in [open data ecosystems](/modern-open-data-portals). A few folks do expensive [curation work](/community-level-open-data-infrastructure), the rest of us [free-ride](/handbook/public-goods-funding/), and eventually the dataset goes stale because data wrangling is time consuming, tedious, and technically demanding. [Spending time curating and maintaining datasets for other people to use doesn't make economic sense, unless you can profit from that](https://davidgasquez.com/handbook/open-data).\n\nThis post is about a simple question, and a potential solution. The question is: **Can we design a credibly neutral way to [incentivize](/handbook/incentives/) and elicit useful datasets for tasks with benchmarks?** The solution I came up with is a mechanism I call \"Tributary\". Let's dive in.\n\n## Mechanism\n\nTributary is a PoC [mechanism](/handbook/mechanism-design/) that works like a flipped [open source Kaggle-ish competition](/steering-ais). That is:\n\n- Design a benchmarked task with a **hidden test set**.\n- Keep the **model fixed** (e.g., Random Forest).\n- Let participants submit **data**, not models.\n- Reward contributors based on how much their data improves benchmark performance.\n\nSo instead of \"who trained the best model\", the question is **\"whose data improved the task you care about the most\"**.\n\nHaving a fixed model and scoring criteria makes the evaluation objective and reproducible. It works like a benchmark for the most useful data.\n\nThis framing is close to [DataPerf Training Set Acquisition](https://www.dataperf.org/training-set-acquisition/acquisition-overview), where the core problem is deciding what data to buy under constraints, then scoring quality on held-out evaluation.\n\n## Credible Neutrality\n\nIn this mechanism, [credible neutrality](https://balajis.com/p/credible-neutrality) maps to:\n\n1. **No hand-picked winners** in the rules. If the data improves the score, it gets rewarded.\n2. [**Open code and verifiable execution**](/credible-neutral-ai-competitions). Anyone can check the rules and verify the results.\n3. **Simple mechanism** before fancy economics.\n4. **Slow-changing rules** so people can trust the game.\n\nIn practice, this pushes the design toward plain and simple git repositories, public scripts, deterministic evaluation, and auditable artifacts.\n\n## Tributary\n\nI built a minimal prototype, [`tributary`](https://github.com/davidgasquez/tributary), that implements the above mechanism. It has:\n\n- A `public.asc` for encrypting participants' dataset submissions\n- An encrypted test set (`data/test.csv.asc`)\n- A fixed model (`model.py`)\n- A script to evaluate submissions (`evaluate.py`)\n- A registry for submissions (`submissions.yaml`)\n\n### Workflow\n\nSay you want to contribute a dataset. You play the game by:\n\n1. Encrypting the dataset you want to train the model with using the public key.\n2. Opening a PR adding a URL to `submissions.yaml`. Ideally you point to a [CID](/handbook/ipfs/) (immutable hash of the content).\n3. The PR gets merged and `tributary` downloads, decrypts, trains a fixed model, computes score on hidden test set and updates the leaderboard based on the **marginal contribution** of your dataset.\n\nSince we cannot directly verify every row, the mechanism pays for information that improves predictive power or agreement structure. We can use [Shapley values](https://christophm.github.io/interpretable-ml-book/shapley.html) or similar techniques to derive the marginal contribution of each row in the dataset.\n\n## Conclusion\n\nI don't think this mechanism is perfect and definitely needs more work, but I do think it is a credibly neutral path to test a narrower claim. That is, **given a benchmarked task, can we reward the creation of useful data directly, in public, with rules everyone can audit?**.\n\nHopefully, something like this can be used in the future to start rewarding the hard work of data curation of open datasets, and to start building a culture of dataset maintenance and stewardship.\n\nFinally, here are some extra resources on [peer prediction and information elicitation work](/handbook/mechanism-design/) you might find useful.\n\n- [Truthful Data Acquisition via Peer Prediction (NeurIPS 2020)](https://proceedings.neurips.cc/paper/2020/file/d35b05a832e2bb91f110d54e34e2da79-Paper.pdf)\n- [Peer Truth Serum (2017)](https://arxiv.org/pdf/1704.05269)\n- [A Market Framework for Eliciting Private Data (NeurIPS 2015)](https://proceedings.neurips.cc/paper/2015/file/7af6266cc52234b5aa339b16695f7fc4-Paper.pdf)",
    "version": "1.0"
  },
  "description": "Open datasets are everywhere. Maintained datasets are rare. I keep seeing the same pattern in open data ecosystems. A few folks do expensive curation work, the rest of us free-ride, and eventually the dataset goes stale because data wrangling is time consuming, tedious, and te...",
  "path": "/tributary-datasets",
  "publishedAt": "2026-02-11T00:00:00.000Z",
  "site": "at://did:plc:4z5i7njrld66ew36htufcwry/site.standard.publication/3mo43d2tmt2ov",
  "textContent": "Open datasets are everywhere. Maintained datasets are rare.\n\nI keep seeing the same pattern in open data ecosystems. A few folks do expensive curation work, the rest of us free-ride, and eventually the dataset goes stale because data wrangling is time consuming, tedious, and technically demanding. Spending time curating and maintaining datasets for other people to use doesn't make economic sense, unless you can profit from that.\n\nThis post is about a simple question, and a potential solution. The question is: Can we design a credibly neutral way to incentivize and elicit useful datasets for tasks with benchmarks? The solution I came up with is a mechanism I call \"Tributary\". Let's dive in.\n\nMechanism\n\nTributary is a PoC mechanism that works like a flipped open source Kaggle-ish competition. That is:\nDesign a benchmarked task with a hidden test set.\nKeep the model fixed (e.g., Random Forest).\nLet participants submit data, not models.\nReward contributors based on how much their data improves benchmark performance.\n\nSo instead of \"who trained the best model\", the question is \"whose data improved the task you care about the most\".\n\nHaving a fixed model and scoring criteria makes the evaluation objective and reproducible. It works like a benchmark for the most useful data.\n\nThis framing is close to DataPerf Training Set Acquisition, where the core problem is deciding what data to buy under constraints, then scoring quality on held-out evaluation.\n\nCredible Neutrality\n\nIn this mechanism, credible neutrality maps to:\nNo hand-picked winners in the rules. If the data improves the score, it gets rewarded.\nOpen code and verifiable execution. Anyone can check the rules and verify the results.\nSimple mechanism before fancy economics.\nSlow-changing rules so people can trust the game.\n\nIn practice, this pushes the design toward plain and simple git repositories, public scripts, deterministic evaluation, and auditable artifacts.\n\nTributary\n\nI built a minimal prototype, tributary, that implements the above mechanism. It has:\nA public.asc for encrypting participants' dataset submissions\nAn encrypted test set (data/test.csv.asc)\nA fixed model (model.py)\nA script to evaluate submissions (evaluate.py)\nA registry for submissions (submissions.yaml)\n\nWorkflow\n\nSay you want to contribute a dataset. You play the game by:\nEncrypting the dataset you want to train the model with using the public key.\nOpening a PR adding a URL to submissions.yaml. Ideally you point to a CID (immutable hash of the content).\nThe PR gets merged and tributary downloads, decrypts, trains a fixed model, computes score on hidden test set and updates the leaderboard based on the marginal contribution of your dataset.\n\nSince we cannot directly verify every row, the mechanism pays for information that improves predictive power or agreement structure. We can use Shapley values or similar techniques to derive the marginal contribution of each row in the dataset.\n\nConclusion\n\nI don't think this mechanism is perfect and definitely needs more work, but I do think it is a credibly neutral path to test a narrower claim. That is, given a benchmarked task, can we reward the creation of useful data directly, in public, with rules everyone can audit?.\n\nHopefully, something like this can be used in the future to start rewarding the hard work of data curation of open datasets, and to start building a culture of dataset maintenance and stewardship.\n\nFinally, here are some extra resources on peer prediction and information elicitation work you might find useful.\nTruthful Data Acquisition via Peer Prediction (NeurIPS 2020)\nPeer Truth Serum (2017)\nA Market Framework for Eliciting Private Data (NeurIPS 2015)",
  "title": "Eliciting useful datasets"
}