Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation
This work addresses the limitation of existing depth estimation methods that struggle in diverse scenarios due to lack of training data, offering an incremental improvement for applications in autonomous driving and robotics.
The paper tackles the problem of robust monocular depth estimation in challenging conditions like low-light or rain by using stable diffusion to generate synthetic training data and integrating DINOv2 for semantic priors, achieving effective results on nuScenes and Oxford RobotCar datasets.
Monocular depth estimation is a crucial task in computer vision. While existing methods have shown impressive results under standard conditions, they often face challenges in reliably performing in scenarios such as low-light or rainy conditions due to the absence of diverse training data. This paper introduces a novel approach named Stealing Stable Diffusion (SSD) prior for robust monocular depth estimation. The approach addresses this limitation by utilizing stable diffusion to generate synthetic images that mimic challenging conditions. Additionally, a self-training mechanism is introduced to enhance the model's depth estimation capability in such challenging environments. To enhance the utilization of the stable diffusion prior further, the DINOv2 encoder is integrated into the depth model architecture, enabling the model to leverage rich semantic priors and improve its scene understanding. Furthermore, a teacher loss is introduced to guide the student models in acquiring meaningful knowledge independently, thus reducing their dependency on the teacher models. The effectiveness of the approach is evaluated on nuScenes and Oxford RobotCar, two challenging public datasets, with the results showing the efficacy of the method. Source code and weights are available at: https://github.com/hitcslj/SSD.