External Publication
Visit Post

Beyond ReconVLA: Annotation-Free Visual Grounding via Language-Attention Masked Reconstruction

Hugging Face Forums [Unofficial] March 17, 2026
Source

Thanks for the thoughtful response . This is exactly the kind of critique I was hoping to surface before running the experiments.

Your point about attention maps being unreliable early in training is a very real concern. Cross-attention is often poorly calibrated until the model has already learned some degree of language–vision alignment. If we immediately treat those maps as gaze proxies, the masking policy could end up chasing noise rather than meaningful spatial structure.

I think your suggestion of introducing some form of stabilization phase is a very sensible direction. One approach that seems promising is a simple curriculum:

Phase 1 : Representation warm-up Start with standard MAE-style random masking so the encoder learns basic spatial and geometric structure without relying on language alignment.

Phase 2 : Hybrid masking Gradually mix random masking with attention-derived masking. The idea would be to slowly introduce instruction-aware reconstruction pressure once the encoder begins forming usable visual features.

Phase 3 : Full LA-ReconVLA masking Once cross-attention patterns become more concentrated, rely primarily on attention-selected patches as reconstruction targets.

The goal would be to avoid the early-training instability while still letting the model eventually benefit from instruction-conditioned masking.

Your comment also made me think about temporal stability , which might be another useful signal in manipulation datasets. Since trajectories are continuous, the true interaction region tends to persist across nearby frames. Instead of relying on a single frame’s attention map, we could aggregate attention over a short temporal window (for example a few adjacent observations in the trajectory). That might suppress a lot of the early noise and produce a more stable proxy for the “task-relevant region.”

Another small mechanism I am considering is confidence gating on the attention distribution. If the attention map is too diffuse (high entropy), the model could fall back to random masking for that sample. Only when attention becomes sufficiently concentrated would it be trusted as a masking signal. In practice this might prevent the system from reinforcing spurious alignments during the early stages of training.

None of this is implemented yet right now the goal is just to test the core hypothesis on a very small scale (LIBERO-Spatial, a few tasks, ~50 demos each) and see whether the reconstruction objective actually improves spatial grounding at all. If the signal shows up even at that scale, then these stabilization strategies become worth exploring more seriously.

Really appreciate you taking the time to think through the idea. Feedback like this helps sharpen the experiment design a lot before burning compute on runs.

Discussion in the ATmosphere

Loading comments...