ARCUS-H: Open benchmark for RL behavioral stability under stress (built on SB3)
I’m glad that was helpful.
I would not merge concept drift and valence inversion into one shared “observation/reward corruption” axis.
I would use a hierarchical taxonomy :
Recommended taxonomy
1. Perception / input-side stress
- Concept drift
2. Execution / control-side stress
- Resource constraint
- Trust violation
3. Feedback / objective-side stress
- Valence inversion
4. Later, if you add it
Environment / dynamics-side stress
- latent dynamics shift
- delay
- actuator lag
- hidden-parameter shift
That structure is the cleanest fit for your case. It also matches how the broader robustness-benchmark literature is organized. Real-World RL Suite separates perturbations on action, observation, and reward channels, while Robust-Gymnasium organizes disruptions across observed state and reward, actions, and the environment. (GitHub)
Why I would not combine CD and VI into one axis
Because they are semantically different in a frozen-policy SB3 benchmark.
Your README defines concept drift as an additive shift applied to the executed observation, s_t^exec = s_t + δ_t. That directly changes the policy input. So CD is a true behavioral stressor for a frozen SB3 policy, because predict() acts on observation. SB3’s base API defines predict(observation, state=None, episode_start=None, deterministic=False) as getting the policy action from an observation and optional hidden state. (GitHub)
Valence inversion is different. Your README defines it as r_t^exec = -r_t. For a standard frozen SB3 policy, reward is not an input to predict(). So VI does not perturb action selection in the same direct way. It perturbs the feedback channel and the semantics of logged task success. That makes it important, but different. If you put CD and VI into one shared “corruption” axis, you risk blurring the exact distinction you just clarified. (GitHub)
So my answer is:
- CD belongs with input-side / perception-side stress.
- VI belongs with feedback-side / objective-side stress.
- They should not be collapsed into one joint axis, except perhaps visually under a very broad umbrella like “non-execution-side perturbations.” Even then, I would keep them as clearly separate sub-axes.
Why RC and TV belong together
Your README defines:
- RC as reduced control authority, either attenuating continuous actions or replacing discrete actions with a default action with some probability.
- TV as action-execution mismatch, either mixing continuous actions with a matrix/noise or permuting discrete actions. (GitHub)
Both of those act on the action actually executed by the environment , not on the observation seen by the policy and not on the reward signal recorded afterward. So they are naturally the same top-level class: execution-side stress. Real-World RL Suite’s action-delay framing supports this kind of decomposition, because it also treats the action channel as a distinct failure surface. (GitHub)
The framing I would use in the paper
I would keep all four stressors in one benchmark taxonomy , but not as four peers without structure.
Instead, present them like this:
ARCUS-H v1.1 covers three perturbed RL components:
- Perception-side: concept drift
- Execution-side: resource constraint, trust violation
- Feedback-side: valence inversion
Then add one sentence:
A fourth class, environment/dynamics-side perturbation , is reserved for future versions.
That gives you the simplicity of one taxonomy while preserving the semantic distinctions that matter most. It also aligns well with Robust-Gymnasium’s component-based framing and Real-World RL Suite’s separate treatment of action, observation, and reward perturbations. (GitHub)
Why this is the best choice for ARCUS-H specifically
This framing helps you in three ways.
First, it makes your benchmark easier to explain:
- what the agent sees
- what the agent tries to do
- what the benchmark says happened
Second, it protects you from the strongest criticism of VI:
- VI is still valuable
- but now it is clearly labeled as feedback corruption , not a direct execution stressor for frozen policies
Third, it gives you a natural roadmap:
- v1.1: perception, execution, feedback
- v1.2 or later: environment/dynamics
That is a very maintainable benchmark story.
My direct recommendation
Use this exact top-level split:
- Perception-side: CD
- Execution-side: RC, TV
- Feedback-side: VI
- Future environment-side: dynamics shift, delay, etc.
That is cleaner than either of the two alternatives you proposed:
- cleaner than a single flat taxonomy
- cleaner than a merged “observation/reward corruption” axis
Because observation corruption and reward corruption are not equivalent once the evaluated policy is frozen. (Stable Baselines3 Docs)
Discussion in the ATmosphere