Self-Supervised Learning of Depth and Motion Under Photometric Inconsistency
This work addresses a key limitation in self-supervised depth and pose estimation for real-world dynamic scenes, offering an incremental but practical enhancement for applications like robotics and autonomous driving.
The paper tackles the problem of self-supervised learning for depth and motion estimation in dynamic scenes where photometric consistency assumptions are violated, resulting in substantial improvements in state-of-the-art methods for depth and relative pose estimation on benchmark datasets without added inference overhead.
The self-supervised learning of depth and pose from monocular sequences provides an attractive solution by using the photometric consistency of nearby frames as it depends much less on the ground-truth data. In this paper, we address the issue when previous assumptions of the self-supervised approaches are violated due to the dynamic nature of real-world scenes. Different from handling the noise as uncertainty, our key idea is to incorporate more robust geometric quantities and enforce internal consistency in the temporal image sequence. As demonstrated on commonly used benchmark datasets, the proposed method substantially improves the state-of-the-art methods on both depth and relative pose estimation for monocular image sequences, without adding inference overhead.