Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model
This addresses the challenge of accurate depth estimation without prior training on specific datasets, which is crucial for applications in robotics and augmented reality, though it builds incrementally on existing diffusion methods.
The paper tackles the problem of zero-shot metric depth estimation by proposing a generic diffusion model that jointly handles indoor and outdoor scenes and resolves depth-scale ambiguity through field-of-view conditioning, achieving a 25% reduction in relative error on zero-shot indoor and 33% reduction on zero-shot outdoor datasets over the current state-of-the-art.
While methods for monocular depth estimation have made significant strides on standard benchmarks, zero-shot metric depth estimation remains unsolved. Challenges include the joint modeling of indoor and outdoor scenes, which often exhibit significantly different distributions of RGB and depth, and the depth-scale ambiguity due to unknown camera intrinsics. Recent work has proposed specialized multi-head architectures for jointly modeling indoor and outdoor scenes. In contrast, we advocate a generic, task-agnostic diffusion model, with several advancements such as log-scale depth parameterization to enable joint modeling of indoor and outdoor scenes, conditioning on the field-of-view (FOV) to handle scale ambiguity and synthetically augmenting FOV during training to generalize beyond the limited camera intrinsics in training datasets. Furthermore, by employing a more diverse training mixture than is common, and an efficient diffusion parameterization, our method, DMD (Diffusion for Metric Depth) achieves a 25\% reduction in relative error (REL) on zero-shot indoor and 33\% reduction on zero-shot outdoor datasets over the current SOTA using only a small number of denoising steps. For an overview see https://diffusion-vision.github.io/dmd