{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreihcoexyjfwj7yrzgwavatpioqjcrgdzi7dng45ouozgw6y2cwrenu",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhetth6uxph2"
},
"path": "/t/arcus-h-open-benchmark-for-rl-behavioral-stability-under-stress-built-on-sb3/174387#post_1",
"publishedAt": "2026-03-18T20:23:18.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"https://zenodo.org/records/19075167",
"https://github.com/karimzn00/ARCUSH_1.0"
],
"textContent": "Hi HF community,\n\nI built ARCUS-H, an open evaluation harness that measures behavioral stability under stress as a complement to reward-based RL evaluation. It’s built entirely on Stable-Baselines3 and Gymnasium, so it should be immediately familiar to anyone in this community.\n\n**The core problem it solves:** Return tells you how well an agent performs in nominal conditions. It doesn’t tell you what happens when control authority is reduced, action execution is noisy, or reward feedback is corrupted. ARCUS-H standardizes stress evaluation so these comparisons are reproducible and algorithm-agnostic.\n\n**Main empirical finding:**\n\n`r = +0.14, p = 0.364 between normalized reward and collapse rate under valence inversion — no significant correlation across 9 environments and 7 algorithms (PPO, A2C, TRPO, DQN, DDPG, SAC, TD3). The highest-reward agents (SAC/TD3 on MuJoCo) collapse most severely under stress.`\n\n**What’s in the benchmark:**\n\n * 4 stress schedules: concept drift, resource constraint, trust violation, valence inversion\n\n * PRE → SHOCK → POST phase structure (40 episodes each)\n\n * Adaptive calibration from pre-phase (FPR = 2.0%, target α = 0.05)\n\n * 5 behavioral channels: competence, coherence, continuity, integrity, meaning\n\n * 9 environments, 7 algorithms, 10 seeds, ~830 total runs\n\n * 15 benchmark plots (PNG + PDF)\n\n\n\n\nEverything is open:\n\nPaper: https://zenodo.org/records/19075167\n\nCode: https://github.com/karimzn00/ARCUSH_1.0\n\nQuestions :\n\nI’d love feedback on from this community specifically:\n\n * Does the SB3 integration feel clean?\n * Are there environments or algorithms on HF Hub that would make good additions to the benchmark suite?\n\n",
"title": "ARCUS-H: Open benchmark for RL behavioral stability under stress (built on SB3)"
}