CVMay 5, 2025

VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery

arXiv:2505.02704v33.6Has Code

Originality Incremental advance

AI Analysis

This addresses the limitation of relative depth methods for downstream tasks by providing a universal alignment module, though it is incremental as it builds on existing language-based recovery approaches.

The paper tackles the problem of monocular depth estimation lacking absolute scale by using textual descriptions, which are ambiguous, and introduces VGLD to incorporate visual semantics for disambiguation, achieving robust and accurate metric predictions on benchmarks like NYUv2 and KITTI.

Monocular depth estimation can be broadly categorized into two directions: relative depth estimation, which predicts normalized or inverse depth without absolute scale, and metric depth estimation, which aims to recover depth with real-world scale. While relative methods are flexible and data-efficient, their lack of metric scale limits their utility in downstream tasks. A promising solution is to infer absolute scale from textual descriptions. However, such language-based recovery is highly sensitive to natural language ambiguity, as the same image may be described differently across perspectives and styles. To address this, we introduce VGLD (Visually-Guided Linguistic Disambiguation), a framework that incorporates high-level visual semantics to resolve ambiguity in textual inputs. By jointly encoding both image and text, VGLD predicts a set of global linear transformation parameters that align relative depth maps with metric scale. This visually grounded disambiguation improves the stability and accuracy of scale estimation. We evaluate VGLD on representative models, including MiDaS and DepthAnything, using standard indoor (NYUv2) and outdoor (KITTI) benchmarks. Results show that VGLD significantly mitigates scale estimation bias caused by inconsistent or ambiguous language, achieving robust and accurate metric predictions. Moreover, when trained on multiple datasets, VGLD functions as a universal and lightweight alignment module, maintaining strong performance even in zero-shot settings. Code will be released upon acceptance.

View on arXiv PDF Code

Similar