CVAug 1, 2025

CoProU-VO: Combining Projected Uncertainty for End-to-End Unsupervised Monocular Visual Odometry

Jingchao Xie, Oussema Dhaouadi, Weirong Chen, Johannes Meier, Jacques Kaiser, Daniel Cremers

arXiv:2508.00568v16.21 citationsh-index: 4DAGM GCPR

Originality Incremental advance

AI Analysis

This addresses a key bottleneck in autonomous navigation and robotics by enhancing robustness in dynamic scenes without requiring ground-truth labels, though it is incremental as it builds on existing uncertainty modeling techniques.

The paper tackles the problem of dynamic objects causing erroneous pose estimations in unsupervised monocular visual odometry by introducing CoProU-VO, which combines uncertainty across temporal frames to filter out unreliable regions. Experiments on KITTI and nuScenes datasets show significant improvements over previous methods, particularly in challenging highway scenes.

Visual Odometry (VO) is fundamental to autonomous navigation, robotics, and augmented reality, with unsupervised approaches eliminating the need for expensive ground-truth labels. However, these methods struggle when dynamic objects violate the static scene assumption, leading to erroneous pose estimations. We tackle this problem by uncertainty modeling, which is a commonly used technique that creates robust masks to filter out dynamic objects and occlusions without requiring explicit motion segmentation. Traditional uncertainty modeling considers only single-frame information, overlooking the uncertainties across consecutive frames. Our key insight is that uncertainty must be propagated and combined across temporal frames to effectively identify unreliable regions, particularly in dynamic scenes. To address this challenge, we introduce Combined Projected Uncertainty VO (CoProU-VO), a novel end-to-end approach that combines target frame uncertainty with projected reference frame uncertainty using a principled probabilistic formulation. Built upon vision transformer backbones, our model simultaneously learns depth, uncertainty estimation, and camera poses. Consequently, experiments on the KITTI and nuScenes datasets demonstrate significant improvements over previous unsupervised monocular end-to-end two-frame-based methods and exhibit strong performance in challenging highway scenes where other approaches often fail. Additionally, comprehensive ablation studies validate the effectiveness of cross-frame uncertainty propagation.

View on arXiv PDF

Similar