ROMay 31

DIPOLE: Fusing Vision and Geometry for Robust Visuomotor Generalization

arXiv:2511.2244576.52 citationsh-index: 8
AI Analysis

For imitation learning in robotics, DIPOLE provides a robust visuomotor policy that generalizes to changes in lighting, texture, viewpoint, object placement, and identity.

DIPOLE fuses vision and geometry via modality-wise dropout and cross-attention to improve visuomotor policy generalization under test-time variations. It outperforms six baselines by 39.1% on average across 18 simulated and 4 real-world tasks, with 41.5% gains under unseen visual distractors and 15.2% under randomized object placement.

Imitation learning has emerged as a crucial approach for acquiring visuomotor skills from demonstrations, where designing effective observation encoders is essential for policy generalization. However, existing methods tend to struggle once test-time conditions differ from the demonstrations, such as changes in lighting, texture, viewpoint, object placement, or object identity. To address this challenge, we propose DIffusion POlicy with compLementarity Encoders (DIPOLE), a visuomotor policy that learns to fuse complementary modalities through a training-time mechanism rather than a specialized fusion architecture. A modality-wise dropout masks one branch at each training step, encouraging each modality to remain individually informative. A lightweight cross-attention layer then exchanges complementary cues between the two. This design endows DIPOLE with five core strengths: stable high performance across diverse tasks, robustness to visual changes, spatial generalization at sub-centimeter precision, emergent capability beyond either modality, and zero-shot transfer to unseen objects. Across 18 simulated and 4 real-world tasks, DIPOLE outperforms six baselines by 39.1% on average, with gains of 41.5% under unseen visual distractors and 15.2% under randomized object placement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes