External Publication

Visit Post

ARCUS-H: Open benchmark for RL behavioral stability under stress (built on SB3)

Hugging Face Forums [Unofficial] March 21, 2026

Source

I’m glad that was helpful.

I would not merge concept drift and valence inversion into one shared “observation/reward corruption” axis.

I would use a hierarchical taxonomy :

Recommended taxonomy

1. Perception / input-side stress

Concept drift

2. Execution / control-side stress

Resource constraint
Trust violation

3. Feedback / objective-side stress

Valence inversion

4. Later, if you add it

Environment / dynamics-side stress
- latent dynamics shift
- delay
- actuator lag
- hidden-parameter shift

That structure is the cleanest fit for your case. It also matches how the broader robustness-benchmark literature is organized. Real-World RL Suite separates perturbations on action, observation, and reward channels, while Robust-Gymnasium organizes disruptions across observed state and reward, actions, and the environment. (GitHub)

Why I would not combine CD and VI into one axis

Because they are semantically different in a frozen-policy SB3 benchmark.

Your README defines concept drift as an additive shift applied to the executed observation, s_t^exec = s_t + δ_t. That directly changes the policy input. So CD is a true behavioral stressor for a frozen SB3 policy, because predict() acts on observation. SB3’s base API defines predict(observation, state=None, episode_start=None, deterministic=False) as getting the policy action from an observation and optional hidden state. (GitHub)

Valence inversion is different. Your README defines it as r_t^exec = -r_t. For a standard frozen SB3 policy, reward is not an input to predict(). So VI does not perturb action selection in the same direct way. It perturbs the feedback channel and the semantics of logged task success. That makes it important, but different. If you put CD and VI into one shared “corruption” axis, you risk blurring the exact distinction you just clarified. (GitHub)

So my answer is:

CD belongs with input-side / perception-side stress.
VI belongs with feedback-side / objective-side stress.
They should not be collapsed into one joint axis, except perhaps visually under a very broad umbrella like “non-execution-side perturbations.” Even then, I would keep them as clearly separate sub-axes.

Why RC and TV belong together

Your README defines:

RC as reduced control authority, either attenuating continuous actions or replacing discrete actions with a default action with some probability.
TV as action-execution mismatch, either mixing continuous actions with a matrix/noise or permuting discrete actions. (GitHub)

Both of those act on the action actually executed by the environment , not on the observation seen by the policy and not on the reward signal recorded afterward. So they are naturally the same top-level class: execution-side stress. Real-World RL Suite’s action-delay framing supports this kind of decomposition, because it also treats the action channel as a distinct failure surface. (GitHub)

The framing I would use in the paper

I would keep all four stressors in one benchmark taxonomy , but not as four peers without structure.

Instead, present them like this:

ARCUS-H v1.1 covers three perturbed RL components:

Perception-side: concept drift

Execution-side: resource constraint, trust violation

Feedback-side: valence inversion

Then add one sentence:

A fourth class, environment/dynamics-side perturbation , is reserved for future versions.

That gives you the simplicity of one taxonomy while preserving the semantic distinctions that matter most. It also aligns well with Robust-Gymnasium’s component-based framing and Real-World RL Suite’s separate treatment of action, observation, and reward perturbations. (GitHub)

Why this is the best choice for ARCUS-H specifically

This framing helps you in three ways.

First, it makes your benchmark easier to explain:

what the agent sees
what the agent tries to do
what the benchmark says happened

Second, it protects you from the strongest criticism of VI:

VI is still valuable
but now it is clearly labeled as feedback corruption , not a direct execution stressor for frozen policies

Third, it gives you a natural roadmap:

v1.1: perception, execution, feedback
v1.2 or later: environment/dynamics

That is a very maintainable benchmark story.

My direct recommendation

Use this exact top-level split:

Perception-side: CD
Execution-side: RC, TV
Feedback-side: VI
Future environment-side: dynamics shift, delay, etc.

That is cleaner than either of the two alternatives you proposed:

cleaner than a single flat taxonomy
cleaner than a merged “observation/reward corruption” axis

Because observation corruption and reward corruption are not equivalent once the evaluated policy is frozen. (Stable Baselines3 Docs)