Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates
This work addresses the problem of improving multi-view 3D vision for computer vision researchers by offering an incremental method that leverages dense depth maps.
The paper tackles the challenge of integrating monocular depth estimates into structure-from-motion by proposing marginalized bundle adjustment to mitigate error variance, achieving state-of-the-art or competitive results in camera pose estimation across varying scales.
Structure-from-Motion (SfM) is a fundamental 3D vision task for recovering camera parameters and scene geometry from multi-view images. While recent deep learning advances enable accurate Monocular Depth Estimation (MDE) from single images without depending on camera motion, integrating MDE into SfM remains a challenge. Unlike conventional triangulated sparse point clouds, MDE produces dense depth maps with significantly higher error variance. Inspired by modern RANSAC estimators, we propose Marginalized Bundle Adjustment (MBA) to mitigate MDE error variance leveraging its density. With MBA, we show that MDE depth maps are sufficiently accurate to yield SoTA or competitive results in SfM and camera relocalization tasks. Through extensive evaluations, we demonstrate consistently robust performance across varying scales, ranging from few-frame setups to large multi-view systems with thousands of images. Our method highlights the significant potential of MDE in multi-view 3D vision.