{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreierudrqtbk4vdmmp5jahcg66ntud6rk4ll6kut5ugwbrmyk5dgh2q",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mimwm27dezj2"
  },
  "path": "/t/arcus-h-full-evaluation-results-979-200-episodes-51-rl-policies/174942#post_1",
  "publishedAt": "2026-04-03T23:23:23.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "https://github.com/karimzn00/ARCUSH"
  ],
  "textContent": "We completed a large behavioral stability evaluation of trained RL policies of : **979,200** **evaluation** episodes across **51 policy configurations** , 12 environments, 8 algorithms, and 8 structured stress schedules. Here are three findings that matter for deployment.\n\n**Finding 1:** Reward explains 3.7% of behavioral stability variance.\nThe primary correlation between ARCUS-H stability scores and normalized reward is r = +0.240 [0.111, 0.354], p = 1.1×10⁻⁴ (n = 255 policy-level observations, 2,550 seed-level). R² = 0.057.\n94.3% of the variance in how a policy behaves under sensor noise, actuator failure, or reward corruption is not captured by its return in clean conditions. 87% of policies rank differently under ARCUS-H vs reward rankings, with a mean rank shift of 74.4 positions.\n\n**Finding 2 :** SAC’s entropy objective amplifies sensor fragility.\nSAC collapses at 92.5% under observation noise. TD3 collapses at 61.0% under the identical stressor — same environments, same training budget, both off-policy actor-critic.\nThis was first observed in a pilot evaluation on 47 pairs (90.2%/61.1%) and is now replicated across 51 pairs and 10 seeds. The mechanism is clear: SAC’s entropy maximization amplifies sensitivity to noisy observations. TD3’s target action smoothing provides implicit robustness.\nIf you are choosing between SAC and TD3 for a noisy real-world deployment: this matters. Return alone will not tell you.\n\n**Finding 3 :** CNN robustness is representation-dependent, not architecture-determined.\nALE/SpaceInvaders-v5 collapses at 13% under observation noise. ALE/Pong-v5 collapses at 42% under the identical stressor. Same CNN architecture. Same AtariPreprocessing + FrameStack wrapper.\nThe difference is learned representation structure. SpaceInvaders requires the CNN to develop distributed, compositional features. Pong can be solved with localized object tracking. Different task complexity produces different representation structure, which produces different robustness to pixel noise.\n\nThe implication for sim-to-real: you cannot infer a CNN policy’s sensor robustness from its architecture. You have to measure it.\n\nARCUS-H is **open source**. No retraining required. Works with any SB3 policy.\nRun on your SB3 model\n\nbash\n\n\n    git clone https://github.com/karimzn00/ARCUSH\n\n    python -m arcus.harness_rl.run_eval \\\n        --run_dir path/to/your/model \\\n        --env     HalfCheetah-v4 \\\n        --algo    td3 \\\n        --seeds   0-4 \\\n        --episodes 120 \\\n        --both\n\n    # Atari (add obs-normalize for stressor symmetry):\n    python -m arcus.harness_rl.run_eval \\\n        --env ALE/Pong-v5 --algo ppo \\\n        --seeds 0-4 --episodes 120 --both --obs_normalize\n\n\nCode + more details : https://github.com/karimzn00/ARCUSH",
  "title": "ARCUS-H: Full Evaluation Results — 979,200 Episodes, 51 RL Policies"
}