Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA
This work addresses robustness and generalizability issues in embodied intelligence for VLA systems, though it appears incremental as it builds upon existing LAM frameworks.
The paper tackled bottlenecks in Latent Action Models for Vision-Language-Action systems, such as poor spatial understanding and limited temporal perception, by proposing Farsighted-LAM with geometry-aware spatial encoding and multi-scale temporal modeling, achieving state-of-the-art performance on multiple VLA tasks.
Latent Action Models (LAMs) enable Vision-Language-Action (VLA) systems to learn semantic action representations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are distant, leading to limited temporal perception. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry-aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end-to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real-world settings, and achieve state-of-the-art performance. Our results demonstrate that our strategy of combining geometry-aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.