{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreief5nmbfwjtgxmgkmy3pus6jpkv4erkhwtb6bz7s3rni2j7yoald4",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhhl4ulelht2"
},
"path": "/t/arcus-h-open-benchmark-for-rl-behavioral-stability-under-stress-built-on-sb3/174387#post_2",
"publishedAt": "2026-03-20T00:22:41.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"GitHub",
"Hugging Face"
],
"textContent": "for now, quick feedback:\n\n* * *\n\nHere is a paste-ready maintainer-style review. Remove the citations before posting if you want a cleaner version.\n\nI took a close look at the public repo and docs, and the short version is: **this feels like a real SB3-native benchmark, not a custom RL fork**. The overall shape is strong. It stays inside the Stable-Baselines3 and Gymnasium workflow, keeps the agent interface standard, and adds a separate stress-evaluation layer with a clear PRE → SHOCK → POST protocol. That is a good design choice for adoption because people can understand it immediately if they already use SB3, RL Zoo, or the SB3 Hugging Face models. ARCUS-H’s current public scope also looks benchmark-sized rather than anecdotal: 9 environments, 7 algorithms, 4 stress schedules, 10 seeds, and 120-episode evaluation runs. (GitHub)\n\nOn the specific question of whether the **SB3 integration feels clean** , my answer is: **yes at the user-facing level, but not fully yet at the benchmark-hygiene level**. The training path is clearly aligned with normal SB3 practice. The code auto-selects `CnnPolicy`, `MultiInputPolicy`, or `MlpPolicy` from the observation space, uses SB3 core algorithms plus `sb3-contrib` for TRPO, and relies on standard env utilities rather than custom training logic. That is exactly the kind of interface the SB3 community tends to trust. (GitHub)\n\nThe main technical cleanup I would make before expanding the benchmark is **Atari train/eval symmetry**. In the current public code, the train path uses SB3’s Atari env utilities, while the eval path manually wraps Atari with `AtariPreprocessing(..., frame_skip=1, terminal_on_life_loss=False, ...)` and frame stacking. SB3’s Atari wrapper defaults are materially different: `frame_skip=4`, `terminal_on_life_loss=True`, and clipped rewards by default. On Atari, those are not cosmetic differences. They can change the effective task and the meaning of the evaluation result. I would make training and evaluation wrapper stacks match exactly by default, then apply ARCUS-H stress on top. (GitHub)\n\nThe second thing I would tighten is **metric semantics around reward corruption**. ARCUS-H’s valence inversion stressor flips reward during SHOCK, but the evaluator is still driving fixed SB3 policies through `model.predict(obs, deterministic=...)`. In standard SB3 inference, action selection is observation-driven; reward is not an input to `predict()`. So for frozen policies, reward inversion is not on the same footing as action attenuation, action permutation, or observation drift. It is still a useful track, but I would probably present it as a **reward-channel corruption track** rather than mix it directly with execution-side stressors in the strongest headline claim. That would make the benchmark story cleaner and preempt an obvious criticism. (GitHub)\n\nI would also do one pass on **reproducibility and packaging**. The repo README currently advertises Python 3.9+, while SB3’s current stable docs say 2.7.1 is the last release supporting Python 3.9 and recommend Python 3.10 or newer. SB3’s reproducibility guidance also explicitly says that deterministic results on a fixed setup require passing a `seed` when creating the model, and that exact reproducibility is still not guaranteed across platforms or PyTorch versions. For a benchmark, that means version pinning, explicit model seeding, and logging wrapper stacks and package versions are worth treating as first-class metadata. (GitHub)\n\nOn the second question, **there are definitely good additions on the Hugging Face Hub** , and I would add them in waves rather than all at once. My first picks would be **MiniGrid FourRooms** , **MiniGrid Unlock** , **LunarLander-v3** , and **QR-DQN on Acrobot-v1**. FourRooms and Unlock are good because they add longer-horizon, interpretable discrete behavior where continuity and integrity failures are easier to see than in some classic-control tasks. LunarLander-v3 is a strong benchmark choice because Gymnasium explicitly notes that v3 fixed reset determinism and episode-to-episode wind independence issues. QR-DQN on Acrobot is the cleanest algorithm-side addition because it lets you test a stronger discrete off-policy baseline without changing the environment family at the same time. All of these already exist in the SB3 organization on the Hub. (Hugging Face)\n\nMy second wave would be **BipedalWalkerHardcore** and **TQC**. BipedalWalkerHardcore is a good stress-test environment because the harder terrain creates richer degradation modes than simpler continuous-control tasks. TQC is the most informative next continuous-control algorithm because it is an SB3-Contrib method designed to improve over SAC-style critic behavior, which makes it especially relevant if the benchmark is already finding that high nominal reward can coexist with brittle stress behavior. The SB3 Hub also has TQC models for robotics tasks, but I would only add those after the evaluator is ready for more complex observation structures. (Hugging Face)\n\nI would leave **RecurrentPPO** and the **Panda robotics tasks** for later. They are good additions, but only after the evaluator explicitly supports recurrent state handling and the observation-side stress logic is ready for more structured inputs. SB3-Contrib’s own docs are very clear that recurrent inference needs `lstm_states` and `episode_start` passed into `predict()` correctly, so I would not put recurrent models on the main leaderboard until that path is implemented cleanly. (Hugging Face)\n\nSo the overall maintainer-style take is: **the benchmark idea is strong, the SB3 integration is mostly clean, and the project feels worth following**. The main things I would fix before broadening the suite are:\n\n 1. make Atari train/eval wrappers identical,\n 2. separate reward-corruption semantics from execution-side stress in the main story, and\n 3. harden reproducibility metadata and dependency pinning.\nAfter that, I would expand first with MiniGrid, LunarLander-v3, and QR-DQN, then add harder continuous-control and memory-dependent models in later benchmark versions. (GitHub)\n\n",
"title": "ARCUS-H: Open benchmark for RL behavioral stability under stress (built on SB3)"
}