CVJan 4

Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation

arXiv:2601.01457v1
Originality Incremental advance
AI Analysis

This addresses the domain-shift sensitivity and scale ambiguity in monocular depth estimation for robotics and autonomous systems, representing an incremental improvement over existing calibration approaches.

The paper tackles the problem of recovering metric scale in monocular depth estimation by using language to predict an uncertainty-aware envelope of feasible calibration parameters and visual features to select image-specific calibrations. Experiments show improved in-domain accuracy on NYUv2 and KITTI and better zero-shot transfer to SUN-RGBD and DDAD compared to language-only baselines.

Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image supervision for learning the envelope and the selected calibration. Experiments on NYUv2 and KITTI improve in-domain accuracy, while zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes