Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering
This addresses the challenge of accurate 3D reconstructions from depth predictions with unknown scale and shift, benefiting computer vision applications, though it is incremental as it builds on mix-dataset training approaches.
The paper tackles the problem of 3D scene structure recovery from monocular depth estimation by proposing a learning framework that predicts geometry-preserving depth without extra data or annotations, achieving superior generalization and outperforming state-of-the-art methods on benchmark datasets.
In this study, we address the challenge of 3D scene structure recovery from monocular depth estimation. While traditional depth estimation methods leverage labeled datasets to directly predict absolute depth, recent advancements advocate for mix-dataset training, enhancing generalization across diverse scenes. However, such mixed dataset training yields depth predictions only up to an unknown scale and shift, hindering accurate 3D reconstructions. Existing solutions necessitate extra 3D datasets or geometry-complete depth annotations, constraints that limit their versatility. In this paper, we propose a learning framework that trains models to predict geometry-preserving depth without requiring extra data or annotations. To produce realistic 3D structures, we render novel views of the reconstructed scenes and design loss functions to promote depth estimation consistency across different views. Comprehensive experiments underscore our framework's superior generalization capabilities, surpassing existing state-of-the-art methods on several benchmark datasets without leveraging extra training information. Moreover, our innovative loss functions empower the model to autonomously recover domain-specific scale-and-shift coefficients using solely unlabeled images.