Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigrv26e7wffilr7wjvy3o4dbv5eeg7r6b6emplze4ztvwjh6jmv6u",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkf3dkb3u652"
  },
  "path": "/t/made-a-python-failure-dataset-for-dpo-rlhf-how-do-you-source-negative-examples/175567#post_1",
  "publishedAt": "2026-04-26T07:08:55.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "huggingface.co",
    "namakoo/idfu-verified-code · Datasets at Hugging Face"
  ],
  "textContent": "Hi everyone,\n\nI’ve been quietly building a Python failure dataset for DPO / RLHF\ntraining over the past couple of weeks, running 24/7 on a single\nRTX 4060.\n\nThe basic idea: an autopilot pipeline generates Python code attempts\nfor various CS domains (FFT, Monte Carlo, ZKP, etc.), runs each in a\nsandboxed pytest container, and keeps the genuine failures with\nerror logs as `rejected`-side training data.\n\nQuick stats:\n\n  * ~2K failure rows shipped (v1, v2)\n  * 19 CS domains covered\n  * 146 downloads since launch\n\n\n\nTwo questions for DPO / RLHF practitioners here:\n\n**1. How are you currently sourcing negative examples for DPO?**\nDo you have your own pipeline, or rely on synthetic data from larger\nmodels? Curious about the trade-offs you’ve found.\n\n**2. What domains do you most need failure data for?**\nI can pivot the autopilot’s domain priority in a few days, so\nconcrete requests directly shape what gets generated next.\n\nFree sample (100 rows):\n\nhuggingface.co\n\n### namakoo/idfu-verified-code · Datasets at Hugging Face\n\nWe’re on a journey to advance and democratize artificial intelligence through open source and open science.\n\nEven one-line replies help calibrate the next release.\n\n-– namakoo",
  "title": "Made a Python failure dataset for DPO/RLHF — how do you source negative examples?"
}