Made a Python failure dataset for DPO/RLHF — how do you source negative examples?
Hi everyone,
I’ve been quietly building a Python failure dataset for DPO / RLHF training over the past couple of weeks, running 24/7 on a single RTX 4060.
The basic idea: an autopilot pipeline generates Python code attempts
for various CS domains (FFT, Monte Carlo, ZKP, etc.), runs each in a
sandboxed pytest container, and keeps the genuine failures with
error logs as rejected-side training data.
Quick stats:
- ~2K failure rows shipped (v1, v2)
- 19 CS domains covered
- 146 downloads since launch
Two questions for DPO / RLHF practitioners here:
1. How are you currently sourcing negative examples for DPO? Do you have your own pipeline, or rely on synthetic data from larger models? Curious about the trade-offs you’ve found.
2. What domains do you most need failure data for? I can pivot the autopilot’s domain priority in a few days, so concrete requests directly shape what gets generated next.
Free sample (100 rows):
huggingface.co
namakoo/idfu-verified-code · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Even one-line replies help calibrate the next release.
-– namakoo
Discussion in the ATmosphere