OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation
For robotic manipulation, OASIS addresses the misalignment between observation and action spaces, improving policy performance and generalization.
OASIS aligns visuomotor policy intermediate representations with action space via SE(3) end-effector trajectory prediction, outperforming VLA and WAM baselines in success rate and out-of-distribution generalization in simulation and real-world robotic manipulation tasks.
Recent vision-language-action (VLA) models and world action models (WAMs) advance robotic manipulation by enriching intermediate representations with auxiliary spatial features or future visual-state prediction. However, these representations largely remain within the observation space and do not share the rigid-body geometry of the action space, forcing the action decoder to implicitly recover this geometry. We propose OASIS, a visuomotor policy that aligns the intermediate representation with the action space via $SE(3)$ end-effector trajectory prediction. OASIS couples a 3D-aware feature encoder that fuses vision-language and metric-depth features with an $SE(3)$ trajectory predictor that produces a camera-frame end-effector trajectory. Conditioned on the predictor's pose-supervised hidden states, the action decoder generates action chunks consistent with rigid-body motion. Across simulation and real-world experiments, OASIS outperforms VLA and WAM baselines in success rate and out-of-distribution generalization. Our project page is available at https://npuhandsome.github.io/OASIS_web.