CVFeb 10

RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes

arXiv:2602.09532v1h-index: 16
Originality Incremental advance
AI Analysis

This addresses the problem of poor depth estimation for underrepresented classes in robotics and autonomous systems, representing a strong domain-specific improvement.

The paper tackles the challenge of accurate monocular metric depth estimation for underrepresented classes in complex scenes by proposing RAD, a retrieval-augmented framework that uses retrieved neighbors as geometric proxies. The result shows significant error reductions of 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes for underrepresented classes while maintaining competitive performance on standard benchmarks.

Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art baselines on underrepresented classes, reducing relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes