ROJun 2

GeoAlign: Beyond Semantics with State-Guided Spatial Alignment in VLA Models

Yizhi Chen, Zhanxiang Cao, Xinyi Peng, Yixiao Zheng, Xiaxi Si, Yiheng Li, Liyun Yan, Keqi Zhu, Xueyun Chen, Shengcheng Fu, Tianyue Zhan, Yufei Jia

arXiv:2606.0324095.7

AI Analysis

This work addresses the need for geometry-aware spatial alignment in robot manipulation for embodied AI researchers, offering strong performance gains on geometry-critical tasks.

GeoAlign introduces a state-guided spatial alignment architecture for VLA models, achieving 99.0% on LIBERO, 85.3% on SimplerEnv-Fractal, and 78.8% on real-world ALOHA tasks, demonstrating the value of geometry post-training and proprioceptive-state-guided querying.

Current Vision--Language--Action (VLA) models often optimize for semantic grounding, whereas executable manipulation requires geometry-aware spatial alignment and dynamic affordance selection. We introduce GeoAlign, a state-guided spatial alignment architecture for VLA policy learning. GeoAlign post-trains an RGB geometry branch with robot-domain RGB-D supervision, yielding RGB-derived Geometry-Enhanced Post-Trained (GEP) features for policy rollout. The robot's proprioceptive state queries the GEP feature grid, producing compact, phase-dependent geometry tokens for action prediction. GeoAlign achieves 99.0% on LIBERO, 85.3% across three SimplerEnv-Fractal tasks, and 78.8% on eight geometry-critical real-world ALOHA tasks, with ablations confirming the value of geometry post-training and proprioceptive-state-guided querying.

View on arXiv PDF

Similar