CVMar 12, 2024

D4D: An RGBD diffusion model to boost monocular depth estimation

arXiv:2403.07516v16.57 citationsh-index: 14Has CodeIEEE transactions on circuits and systems for video technology (Print)

Originality Incremental advance

AI Analysis

This work addresses the data scarcity issue in computer vision for researchers and practitioners, offering an incremental improvement over synthetic data methods.

The paper tackles the problem of limited ground-truth RGBD data for monocular depth estimation by proposing a training pipeline that uses a diffusion model to generate realistic RGBD samples, resulting in RMSE reductions of up to 11.9% on indoor and outdoor datasets.

Ground-truth RGBD data are fundamental for a wide range of computer vision applications; however, those labeled samples are difficult to collect and time-consuming to produce. A common solution to overcome this lack of data is to employ graphic engines to produce synthetic proxies; however, those data do not often reflect real-world images, resulting in poor performance of the trained models at the inference step. In this paper we propose a novel training pipeline that incorporates Diffusion4D (D4D), a customized 4-channels diffusion model able to generate realistic RGBD samples. We show the effectiveness of the developed solution in improving the performances of deep learning models on the monocular depth estimation task, where the correspondence between RGB and depth map is crucial to achieving accurate measurements. Our supervised training pipeline, enriched by the generated samples, outperforms synthetic and original data performances achieving an RMSE reduction of (8.2%, 11.9%) and (8.1%, 6.1%) respectively on the indoor NYU Depth v2 and the outdoor KITTI dataset.

View on arXiv PDF Code

Similar