Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiamed25z3va5mbdz5emvyu4t3nzwxgskoz7ae2krbcjr7m6fdtob4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mn2mkjl2sk42"
  },
  "path": "/t/physical-modelling-of-sim2real-so101-arm-project/176368#post_2",
  "publishedAt": "2026-05-30T06:43:16.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "the LeRobot Discord",
    "NVIDIA Real Evaluation docs",
    "Real Evaluation docs",
    "NVIDIA troubleshooting guide",
    "NVIDIA Datasets and Models page",
    "Datasets and Models page",
    "Co-Training section",
    "Cosmos augmentation section",
    "SAGE + GapONet section",
    "NVIDIA/Isaac-GR00T #367: LoRA finetune bad performance?",
    "NVIDIA/Isaac-GR00T #298: The open-loop testing and actual machine performance are very poor",
    "NVIDIA/Isaac-GR00T #285: Robot arm stuttering during action execution",
    "StoneT2000/lerobot-sim2real #9: Sim2Real Problem for SO101",
    "huggingface/lerobot #3345: Eye To Hand Calibration Using LeRobot",
    "huggingface/lerobot #2413: so100_to_so100_EE / recalibration / URDF question"
  ],
  "textContent": "Personally, given the topic area, I would probably also suggest that you ask in the LeRobot Discord, since people there may have more hands-on SO-101 / camera / calibration experience. But before doing that, I think it may help to organize the report a bit so that others can reproduce or diagnose it more easily:\n\n* * *\n\nI have not run your exact setup myself, so please treat this as a practical debugging / reproducibility checklist rather than a diagnosis. But based on your description, I would not reduce this to only “the physical modeling is wrong” yet.\n\nA more useful way to frame it may be:\n\n> Is this mainly a **workspace / visual distribution issue** , a **camera / calibration issue** , an **expected limitation of the sim-only checkpoint** , or a lower-level **actuation / backlash / physical-modeling issue**?\n\nYour update that success improves when the physical rack / vial / mat placement is made closer to the sim / dataset-like setup is especially informative. If that observation is reproducible, it suggests that the policy may be quite sensitive to the reference workspace geometry and camera observations.\n\n## 1. First clarify the exact checkpoint\n\nThe NVIDIA Real Evaluation docs appear to use this sim-only checkpoint for real evaluation:\n\n\n    export MODEL=aravindhs-NV/grootn16-finetune_sreetz-so101_teleop_vials_rack_left/checkpoint-10000\n\n\nSo I would explicitly confirm whether the run used exactly `checkpoint-10000`, or whether it used another checkpoint such as `checkpoint-1000`, `checkpoint-100005`, a local checkpoint, or a later training artifact.\n\nThis matters because otherwise it is hard to compare:\n\n  * sim evaluation,\n  * real evaluation,\n  * the documented tutorial,\n  * and other users’ results.\n\n\n\nUseful information to add:\n\n\n    Model repo:\n    <model repo>\n\n    Checkpoint actually loaded by the GR00T server:\n    <checkpoint path>\n\n    Server log showing the loaded model:\n    <server log excerpt>\n\n    Was this exactly the documented checkpoint-10000?\n    <yes/no/unclear>\n\n\nIf the checkpoint differs from the documented one, the result may still be useful, but it becomes a different comparison.\n\n## 2. Confirm camera assignment with actual frames\n\nThe same Real Evaluation docs mention using Rerun so that you can inspect joint actions and camera feeds while the policy runs. The NVIDIA troubleshooting guide also calls out camera index changes, wrong camera feed assignment, and camera positioning as possible causes of deployment problems.\n\nSo I would add concrete camera evidence, not just “the cameras are detected.”\n\nFor example:\n\n\n    lerobot-find-cameras opencv\n\n\nThen include:\n\n\n    Camera detection output:\n    <output>\n\n    CAMERA_GRIPPER:\n    <index>\n\n    CAMERA_EXTERNAL:\n    <index>\n\n    One frame from wrist camera:\n    <link or image>\n\n    One frame from front camera:\n    <link or image>\n\n    Rerun screenshot while policy is running:\n    <link or image>\n\n\nThings that would be worth checking:\n\n  * Are `front` and `wrist` definitely not swapped?\n  * Is the wrist camera image oriented as expected?\n  * Is the external camera seeing roughly the same workspace composition as the dataset visualizer?\n  * Are the OpenCV camera indices stable after unplug / replug?\n  * Is the camera physically fixed and not vibrating?\n  * Are focus, exposure, brightness, and white balance stable?\n  * Are the camera views 640x480 as expected by the tutorial command?\n  * Does Rerun show reasonable joint actions and camera feeds during the rollout?\n\n\n\nA camera mismatch can easily produce a situation where the policy runs without a runtime error but fails behaviorally.\n\n## 3. Quantify the workspace geometry\n\nYour observation that real success improves after matching the physical layout more closely seems important. I would make this part quantitative.\n\nInstead of only describing the setup in photos, it would help to provide robot-base-relative or mat-relative measurements.\n\nFor example:\n\nItem | Measurement\n---|---\nRobot base center → mat corner | <x mm, y mm>\nRobot base center → rack center | <x mm, y mm>\nRack yaw |\nRobot base center → vial initial position | <x mm, y mm>\nVial yaw |\nExternal camera position | <x, y, z, yaw, pitch, roll if known>\nWrist camera mount | <photo / approximate angle>\nLight brightness / color temperature |\nCamera exposure / focus |\nMat / rack / vial dimensions |\n\nThe practical question is:\n\n> How narrow is the successful region in workspace coordinates?\n\nIf success only appears when the rack / vial / mat layout is very close to the dataset visualizer or simulated reference setup, then the dominant issue may be workspace / visual distribution sensitivity rather than just physical dynamics.\n\nI would phrase that cautiously:\n\n> If the reported improvement after matching rack / vial / mat placement is reproducible, that suggests the policy may be quite sensitive to the reference workspace geometry and camera observations. I would not conclude “bad physical modeling” until checkpoint identity, camera assignment, camera pose, and workspace geometry are ruled out.\n\n## 4. Add a failure-mode table\n\n“0% success” is useful, but it is hard to debug without knowing how the failures look. A failure-mode table would help others reason about the cause.\n\nFor example:\n\nFailure mode | Count | Possible interpretation\n---|---|---\nDoes not move toward vial |  | Camera / language / action interface issue\nReaches near vial but laterally offset |  | Camera pose / workspace geometry / calibration issue\nGrasps but drops vial |  | Gripper / friction / timing issue\nGrasps vial but misses rack |  | Rack pose / precision / actuation issue\nAlways misses by the same offset |  | Calibration / camera positioning / kinematic offset\nRandom-looking failures |  | Visual instability / distribution shift\nStuttering / jerky execution |  | Action execution / latency / chunking issue\nSucceeds only after geometry tuning |  | Narrow workspace / visual distribution\n\nThis would be especially useful if you can attach a few short clips or frame sequences for representative failures.\n\n## 5. Compare the available policy variants\n\nThe NVIDIA Datasets and Models page lists several relevant model variants:\n\n  * sim-only model: `aravindhs-NV/grootn16-finetune_sreetz-so101_teleop_vials_rack_left`\n  * sim+real model: `aravindhs-NV/grootn16-finetune_sreetz-so101_teleop_vials_rack_left_sim_and_real`\n  * Cosmos-augmented models:\n    * `aravindhs-NV/sreetz-so101_teleop_vials_rack_left_augment_02`\n    * `aravindhs-NV/sreetz-so101_teleop_vials_rack_left_augment_10`\n\n\n\nA useful diagnostic experiment would be to run the same physical setup with multiple checkpoints:\n\nPolicy | Same physical geometry? | Result\n---|---|---\nsim-only checkpoint | yes | <success / trials>\nsim-only checkpoint, reference-like geometry | yes | <success / trials>\nsim+real checkpoint | yes | <success / trials>\nCosmos 7 checkpoint | yes | <success / trials>\nCosmos 70 checkpoint | yes | <success / trials>\n\nPossible interpretations:\n\n  * **Only reference-like geometry works**\n→ likely narrow workspace / initial-condition / visual distribution sensitivity.\n\n  * **Cosmos improves performance**\n→ likely visual variation, lighting, texture, or object-position variation matters.\n\n  * **sim+real improves performance**\n→ real-world grounding is important; sim-only may simply be too weak for robust zero-shot transfer.\n\n  * **All policies miss by the same spatial offset**\n→ calibration, camera pose, or kinematic offset becomes more likely.\n\n  * **All policies stutter or pause**\n→ action execution, latency, or chunk-boundary behavior may need attention.\n\n  * **All policies fail despite correct cameras and geometry**\n→ then deeper actuation / calibration / physical-model mismatch becomes more plausible.\n\n\n\n\nThis comparison would be more useful than only asking whether the sim-only policy “should work.”\n\n## 6. Clarify the expected real-world baseline\n\nOne documentation question seems worth asking directly:\n\n> What real-world success rate should users expect from the documented sim-only checkpoint under the reference SO-101 setup?\n\nThe Real Evaluation docs explain how to run the real robot and inspect Rerun, but it would be useful to know whether near-zero real success is expected for the sim-only checkpoint outside a narrow reference setup, or whether it indicates a setup problem.\n\nRelated documentation questions:\n\n  1. Is `checkpoint-10000` the canonical checkpoint for real evaluation?\n  2. Was the real-evaluation video / baseline success rate measured internally?\n  3. Can the authors share a short successful real rollout video?\n  4. Can the authors share reference front/wrist camera frames?\n  5. Can the authors share approximate reference measurements for robot base, mat, rack, vial initial position, and external camera?\n  6. Is the sim-only checkpoint intended mainly as a baseline before sim+real co-training / Cosmos augmentation?\n  7. Are camera exposure, focus, and white balance expected to be fixed?\n\n\n\nThere is also a small dataset/model-count clarification that may be worth asking. The Datasets and Models page describes the sim+real dataset as **75 sim-only demonstrations + 5 real-world demonstrations** , while the model table describes the sim+real model as **75 sim + 50 real**. It would be useful to clarify which number is correct, because the amount of real data strongly affects the expected real-world behavior.\n\n## 7. Co-training may be the intended next step\n\nThe Co-Training section describes combining simulation data with real-world demonstrations, including small real datasets such as 5 real episodes.\n\nSo I would not assume that the sim-only checkpoint is expected to be robust zero-shot across copied physical setups. It may be better interpreted as a baseline for observing the sim-to-real gap before trying:\n\n  * sim+real co-training,\n  * Cosmos augmentation,\n  * or actuation-gap modeling.\n\n\n\nA useful question for the maintainers would be:\n\n> Is the expected workflow that users first observe the sim-to-real gap with the sim-only checkpoint, and then move to sim+real / Cosmos / SAGE? Or should the sim-only checkpoint already achieve a meaningful real-world success rate in the reference workspace?\n\n## 8. Cosmos comparison can test visual / workspace-distribution sensitivity\n\nThe Cosmos augmentation section discusses augmenting data with visual variations such as lighting, object position, textures, and environmental changes.\n\nThat makes the Cosmos checkpoints useful as a diagnostic tool here.\n\nSuggested comparison:\n\n\n    Same camera placement:\n    <yes/no>\n\n    Same rack / vial / mat placement:\n    <yes/no>\n\n    sim-only success:\n    <n>/<N>\n\n    Cosmos-7 success:\n    <n>/<N>\n\n    Cosmos-70 success:\n    <n>/<N>\n\n    Failure-mode differences:\n    <short notes>\n\n\nIf Cosmos helps, the issue is probably not only actuator physics. It would suggest that visual / workspace distribution is a major factor.\n\nIf Cosmos does not help, but sim+real helps, then real-world grounding may be more important than synthetic visual variation.\n\nIf neither helps and the failure is spatially consistent, calibration / camera pose / actuation gap becomes more likely.\n\n## 9. Actuation / backlash is still possible, but I would check it later\n\nThe SAGE + GapONet section discusses sim-to-real actuation gaps and notes that SO-101 hobby servos can introduce backlash that accumulates through the kinematic chain.\n\nSo actuation gap is definitely a real possibility.\n\nBut I would check it after:\n\n  1. exact checkpoint,\n  2. camera assignment,\n  3. camera pose / focus / exposure,\n  4. reference-like workspace geometry,\n  5. failure-mode consistency,\n  6. sim-only vs sim+real vs Cosmos behavior.\n\n\n\nIf the robot always misses by the same spatial offset even when the camera views and workspace geometry are correct, then calibration / camera pose / actuation gap becomes much more likely.\n\n## 10. Related reports, but not necessarily the same root cause\n\nThere are a few related community reports that may be worth reading. I would not assume they have the same root cause, but they show that SO-101 / GR00T / LeRobot real deployment can be sensitive to grasping, calibration, camera setup, and execution details.\n\n  * NVIDIA/Isaac-GR00T #367: LoRA finetune bad performance?\nReports poor real-robot grasping and severe shaking on a LeRobot SO101 Dual Arm after GR00T LoRA fine-tuning.\n\n  * NVIDIA/Isaac-GR00T #298: The open-loop testing and actual machine performance are very poor\nReports poor open-loop and real-machine performance after fine-tuning on LeRobot S101 / SO101-style data.\n\n  * NVIDIA/Isaac-GR00T #285: Robot arm stuttering during action execution\nReports stuttering / jerky motion between action segments on SO100 / SO101.\n\n  * StoneT2000/lerobot-sim2real #9: Sim2Real Problem for SO101\nReports SO101 real-world grasp failure and discusses possible hardware differences, lighting, shadows, warnings, and end-effector jitter.\n\n  * huggingface/lerobot #3345: Eye To Hand Calibration Using LeRobot\nReports significant eye-to-hand calibration errors with SO-101 and a fixed USB-RGB camera.\n\n  * huggingface/lerobot #2413: so100_to_so100_EE / recalibration / URDF question\nDiscusses mismatch between real joint actions and EE / IK-derived actions.\n\n\n\n\nAgain, these are not proof that this issue has the same cause. They are just useful context.\n\n## 11. If you also ask in LeRobot Discord\n\nFor a LeRobot Discord follow-up, I would make the question short and evidence-heavy. The goal would not be to repost the whole thread, but to ask whether other SO-101 users can compare their working setup against yours.\n\nSomething like this might be easier for people to answer:\n\n\n    Has anyone reproduced the NVIDIA SO-101 sim-to-real real evaluation with the provided GR00T sim-only checkpoint?\n\n    I am trying to distinguish between:\n    - camera feed / camera assignment issues,\n    - workspace geometry or initial-condition distribution shift,\n    - SO-101 calibration / backlash,\n    - and the expected limitation of the sim-only checkpoint.\n\n    The interesting observation is that real success improves when the rack / vial / mat positions are manually matched more closely to the sim / dataset-like layout.\n\n    Could anyone who has run this share:\n    1. exact checkpoint used,\n    2. real success rate,\n    3. front/wrist camera frames,\n    4. external camera pose,\n    5. rack/mat/robot-base measurements,\n    6. whether sim+real or Cosmos checkpoints worked better?\n\n\nThe most useful attachments would probably be:\n\n  * one front-camera frame,\n  * one wrist-camera frame,\n  * one Rerun screenshot,\n  * one top-down setup photo,\n  * a small table of workspace measurements,\n  * success / failure counts,\n  * and a short failure-mode table.\n\n\n\n## 12. Compact information block to add to this thread\n\nIf possible, I would add a compact block like this to the original post:\n\n\n    Exact checkpoint:\n    <checkpoint path>\n\n    GR00T / workshop repo commit:\n    <commit>\n\n    LeRobot version / commit:\n    <version or commit>\n\n    Docker image / tag:\n    <tag>\n\n    Sim eval:\n    <n>/<N> success\n\n    Real eval before geometry adjustment:\n    <n>/<N> success\n\n    Real eval after geometry adjustment:\n    <n>/<N> success\n\n    Camera detection:\n    <lerobot-find-cameras opencv output>\n\n    Front camera frame:\n    <link>\n\n    Wrist camera frame:\n    <link>\n\n    Rerun screenshot:\n    <link>\n\n    Robot-base-relative workspace measurements:\n    <measurements>\n\n    Main failure modes:\n    <counts and notes>\n\n    Other checkpoints tested:\n    <sim+real / Cosmos / none>\n\n\nThat would make the question much easier to answer for both LeRobot users and NVIDIA / GR00T maintainers.\n\n## My current read\n\nBased only on the description, I would not jump straight to “the physical model is wrong.”\n\nThe sharp improvement after matching the physical layout more closely seems more consistent with one or more of:\n\n  * narrow workspace / initial-condition distribution,\n  * camera pose or camera assignment mismatch,\n  * visual distribution shift,\n  * calibration offset,\n  * and only later, actuation / backlash / physical-modeling gaps.\n\n\n\nThe most useful next step is probably a small reproducibility package:\n\n  1. exact checkpoint,\n  2. camera evidence,\n  3. workspace measurements,\n  4. failure taxonomy,\n  5. and a comparison between sim-only / sim+real / Cosmos checkpoints under the same physical setup.\n\n\n\nThat would also make a LeRobot Discord follow-up much more actionable.",
  "title": "Physical Modelling of sim2real SO101 Arm Project"
}