GeoAlign: Beyond Semantics with State-Guided Spatial Alignment in VLA Models
This work addresses the need for geometry-aware spatial alignment in robot manipulation for embodied AI researchers, offering strong performance gains on geometry-critical tasks.
GeoAlign introduces a state-guided spatial alignment architecture for VLA models, achieving 99.0% on LIBERO, 85.3% on SimplerEnv-Fractal, and 78.8% on real-world ALOHA tasks, demonstrating the value of geometry post-training and proprioceptive-state-guided querying.
Current Vision--Language--Action (VLA) models often optimize for semantic grounding, whereas executable manipulation requires geometry-aware spatial alignment and dynamic affordance selection. We introduce GeoAlign, a state-guided spatial alignment architecture for VLA policy learning. GeoAlign post-trains an RGB geometry branch with robot-domain RGB-D supervision, yielding RGB-derived Geometry-Enhanced Post-Trained (GEP) features for policy rollout. The robot's proprioceptive state queries the GEP feature grid, producing compact, phase-dependent geometry tokens for action prediction. GeoAlign achieves 99.0% on LIBERO, 85.3% across three SimplerEnv-Fractal tasks, and 78.8% on eight geometry-critical real-world ALOHA tasks, with ablations confirming the value of geometry post-training and proprioceptive-state-guided querying.