Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiesyh2kkbuciv5my7ahfgawwgghk4cntwc5lq5aakwg7poedvyupa",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhk3nyljphr2"
  },
  "path": "/t/arcus-h-open-benchmark-for-rl-behavioral-stability-under-stress-built-on-sb3/174387#post_3",
  "publishedAt": "2026-03-21T00:42:18.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Thank you for ur feedback, genuinely the most detailed feedback I’ve received so far, and the three priority items are well-targeted.\n\nOn the Atari train/eval wrapper mismatch: you’re right, this is a real inconsistency. Training uses SB3’s AtariWrapper defaults (frame_skip=4, terminal_on_life_loss=True) while eval uses frame_skip=1 and terminal_on_life_loss=False. My original reasoning was that frame_skip=1 gives finer-grained stress measurement per step, but you’re correct that this changes the effective task and makes the Pong results not directly comparable to standard Atari benchmarks. I’ll fix this in v1.1 — train and eval wrappers will match exactly, with stress applied on top.\n\nOn valence inversion semantics: this is the critique I expected and you’ve framed it precisely. VI doesn’t affect model.predict() at all for frozen policies, it only affects the logged reward signal. You’re right that it belongs in a separate reward-channel corruption track rather than being presented on equal footing with execution-side stressors. I’ll restructure the stressor taxonomy in the paper revision accordingly.\n\nOn reproducibility metadata: agreed on version pinning and explicit model seeding. I’ll add a requirements-lock.txt and log wrapper stacks + package versions as first-class eval metadata in v1.1.\n\nOn environment additions: MiniGrid FourRooms and LunarLander-v3 are the right first wave, interpretable discrete behavior where continuity and integrity failures are visible. I’ll hold RecurrentPPO until lstm_states handling in predict() is properly implemented.\n\nOne genuine question back: on the reward-corruption reframing, would you present concept drift and valence inversion as a separate “observation/reward corruption” axis distinct from the “execution-side” axis (RC + TV), or keep all four in one taxonomy with clearer semantic labeling?",
  "title": "ARCUS-H: Open benchmark for RL behavioral stability under stress (built on SB3)"
}