{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifauf56sv23leoyl2wjwzqhfee6kofjqemg2xybpcuicdvmn54cfi",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhkj3hdyhaf2"
},
"path": "/t/arcus-h-open-benchmark-for-rl-behavioral-stability-under-stress-built-on-sb3/174387#post_4",
"publishedAt": "2026-03-21T01:41:27.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"GitHub",
"Stable Baselines3 Docs"
],
"textContent": "I’m glad that was helpful.\n\n* * *\n\nI would **not** merge concept drift and valence inversion into one shared “observation/reward corruption” axis.\n\nI would use a **hierarchical taxonomy** :\n\n## Recommended taxonomy\n\n### 1. Perception / input-side stress\n\n * **Concept drift**\n\n\n\n### 2. Execution / control-side stress\n\n * **Resource constraint**\n * **Trust violation**\n\n\n\n### 3. Feedback / objective-side stress\n\n * **Valence inversion**\n\n\n\n### 4. Later, if you add it\n\n * **Environment / dynamics-side stress**\n\n * latent dynamics shift\n * delay\n * actuator lag\n * hidden-parameter shift\n\n\n\nThat structure is the cleanest fit for your case. It also matches how the broader robustness-benchmark literature is organized. Real-World RL Suite separates perturbations on action, observation, and reward channels, while Robust-Gymnasium organizes disruptions across observed state and reward, actions, and the environment. (GitHub)\n\n## Why I would not combine CD and VI into one axis\n\nBecause they are **semantically different in a frozen-policy SB3 benchmark**.\n\nYour README defines concept drift as an additive shift applied to the executed observation, `s_t^exec = s_t + δ_t`. That directly changes the policy input. So CD is a true **behavioral stressor** for a frozen SB3 policy, because `predict()` acts on observation. SB3’s base API defines `predict(observation, state=None, episode_start=None, deterministic=False)` as getting the policy action from an observation and optional hidden state. (GitHub)\n\nValence inversion is different. Your README defines it as `r_t^exec = -r_t`. For a standard frozen SB3 policy, reward is not an input to `predict()`. So VI does not perturb action selection in the same direct way. It perturbs the **feedback channel** and the semantics of logged task success. That makes it important, but different. If you put CD and VI into one shared “corruption” axis, you risk blurring the exact distinction you just clarified. (GitHub)\n\nSo my answer is:\n\n * **CD** belongs with **input-side / perception-side** stress.\n * **VI** belongs with **feedback-side / objective-side** stress.\n * They should not be collapsed into one joint axis, except perhaps visually under a very broad umbrella like “non-execution-side perturbations.” Even then, I would keep them as clearly separate sub-axes.\n\n\n\n## Why RC and TV belong together\n\nYour README defines:\n\n * **RC** as reduced control authority, either attenuating continuous actions or replacing discrete actions with a default action with some probability.\n * **TV** as action-execution mismatch, either mixing continuous actions with a matrix/noise or permuting discrete actions. (GitHub)\n\n\n\nBoth of those act on the **action actually executed by the environment** , not on the observation seen by the policy and not on the reward signal recorded afterward. So they are naturally the same top-level class: **execution-side stress**. Real-World RL Suite’s action-delay framing supports this kind of decomposition, because it also treats the action channel as a distinct failure surface. (GitHub)\n\n## The framing I would use in the paper\n\nI would keep **all four stressors in one benchmark taxonomy** , but not as four peers without structure.\n\nInstead, present them like this:\n\n> ARCUS-H v1.1 covers three perturbed RL components:\n>\n> * **Perception-side:** concept drift\n> * **Execution-side:** resource constraint, trust violation\n> * **Feedback-side:** valence inversion\n>\n\n\nThen add one sentence:\n\n> A fourth class, **environment/dynamics-side perturbation** , is reserved for future versions.\n\nThat gives you the simplicity of one taxonomy while preserving the semantic distinctions that matter most. It also aligns well with Robust-Gymnasium’s component-based framing and Real-World RL Suite’s separate treatment of action, observation, and reward perturbations. (GitHub)\n\n## Why this is the best choice for ARCUS-H specifically\n\nThis framing helps you in three ways.\n\nFirst, it makes your benchmark easier to explain:\n\n * what the agent **sees**\n * what the agent **tries to do**\n * what the benchmark **says happened**\n\n\n\nSecond, it protects you from the strongest criticism of VI:\n\n * VI is still valuable\n * but now it is clearly labeled as **feedback corruption** , not a direct execution stressor for frozen policies\n\n\n\nThird, it gives you a natural roadmap:\n\n * v1.1: perception, execution, feedback\n * v1.2 or later: environment/dynamics\n\n\n\nThat is a very maintainable benchmark story.\n\n## My direct recommendation\n\nUse this exact top-level split:\n\n * **Perception-side:** CD\n * **Execution-side:** RC, TV\n * **Feedback-side:** VI\n * **Future environment-side:** dynamics shift, delay, etc.\n\n\n\nThat is cleaner than either of the two alternatives you proposed:\n\n * cleaner than a single flat taxonomy\n * cleaner than a merged “observation/reward corruption” axis\n\n\n\nBecause observation corruption and reward corruption are not equivalent once the evaluated policy is frozen. (Stable Baselines3 Docs)",
"title": "ARCUS-H: Open benchmark for RL behavioral stability under stress (built on SB3)"
}