Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting
This work addresses domain gaps in depth estimation for computer vision applications, representing an incremental improvement by enhancing an existing foundation model with novel self-supervision techniques.
The paper tackles the problem of monocular depth estimation for real-world images that differ from training data by introducing a test-time self-supervision framework that refines depth predictions using generative priors, resulting in substantial gains in depth accuracy and realism over the baseline Depth Anything V2 model.
Monocular depth estimation remains challenging as recent foundation models, such as Depth Anything V2 (DA-V2), struggle with real-world images that are far from the training distribution. We introduce Re-Depth Anything, a test-time self-supervision framework that bridges this domain gap by fusing DA-V2 with the powerful priors of large-scale 2D diffusion models. Our method performs label-free refinement directly on the input image by re-lighting predicted depth maps and augmenting the input. This re-synthesis method replaces classical photometric reconstruction by leveraging shape from shading (SfS) cues in a new, generative context with Score Distillation Sampling (SDS). To prevent optimization collapse, our framework employs a targeted optimization strategy: rather than optimizing depth directly or fine-tuning the full model, we freeze the encoder and only update intermediate embeddings while also fine-tuning the decoder. Across diverse benchmarks, Re-Depth Anything yields substantial gains in depth accuracy and realism over the DA-V2, showcasing new avenues for self-supervision by augmenting geometric reasoning.