ROCVMay 11

StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

arXiv:2605.0998992.0
Predicted impact top 9% in RO · last 90 daysOriginality Incremental advance
AI Analysis

For robotic manipulation, this work addresses the lack of depth cues in monocular vision by introducing a scalable stereo-based framework that enhances geometric reasoning without explicit 3D reconstruction.

StereoPolicy improves robotic manipulation policies by using synchronized stereo image pairs instead of monocular inputs, achieving consistent performance gains over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks and real-robot experiments.

Recent advances in robot imitation learning have yielded powerful visuomotor policies capable of manipulating a wide variety of objects directly from monocular visual inputs. However, monocular observations inherently lack reliable depth cues and spatial awareness, which are critical for precise manipulation in cluttered or geometrically complex scenes. To address this limitation, we introduce StereoPolicy, a new visuomotor policy learning framework that directly leverages synchronized stereo image pairs to strengthen geometric reasoning, without requiring explicit 3D reconstruction or camera calibration. StereoPolicy employs pretrained 2D vision encoders to process each image independently and fuses the resulting representations through a Stereo Transformer. This design implicitly captures spatial correspondence and disparity cues. The framework integrates seamlessly with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks: RoboMimic, RoboCasa, and OmniGibson. We further validate StereoPolicy on real-robot experiments spanning both tabletop and bimanual mobile manipulation settings. Our results underscore stereo vision as a scalable and robust modality that bridges 2D pretrained representations with 3D geometric understanding for robotic manipulation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes