DeepRelativeFusion: Dense Monocular SLAM using Single-Image Relative Depth Prediction
This work addresses dense 3D scene reconstruction from monocular video, which is important for robotics and AR/VR applications, and is incremental with two improvements on an existing framework.
The paper tackles dense monocular SLAM by proposing DeepRelativeFusion, which uses relative depth prediction and adaptive filtering to densify semi-dense depth maps and refine camera poses, achieving a large margin improvement in dense reconstruction accuracy over state-of-the-art systems.
In this paper, we propose a dense monocular SLAM system, named DeepRelativeFusion, that is capable to recover a globally consistent 3D structure. To this end, we use a visual SLAM algorithm to reliably recover the camera poses and semi-dense depth maps of the keyframes, and then use relative depth prediction to densify the semi-dense depth maps and refine the keyframe pose-graph. To improve the semi-dense depth maps, we propose an adaptive filtering scheme, which is a structure-preserving weighted average smoothing filter that takes into account the pixel intensity and depth of the neighbouring pixels, yielding substantial reconstruction accuracy gain in densification. To perform densification, we introduce two incremental improvements upon the energy minimization framework proposed by DeepFusion: (1) an improved cost function, and (2) the use of single-image relative depth prediction. After densification, we update the keyframes with two-view consistent optimized semi-dense and dense depth maps to improve pose-graph optimization, providing a feedback loop to refine the keyframe poses for accurate scene reconstruction. Our system outperforms the state-of-the-art dense SLAM systems quantitatively in dense reconstruction accuracy by a large margin.