CVDec 9, 2025

Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth

Kyumin Hwang, Wonhyeok Choi, Kiljoon Han, Wonjoon Choi, Minwoo Choi, Yongcheon Na, Minwoo Park, Sunghoon Im

arXiv:2512.08700v1h-index: 4IEEE Robot Autom Lett

Originality Incremental advance

AI Analysis

This work addresses efficiency and scale consistency issues in monocular depth estimation for autonomous driving applications, representing an incremental improvement over existing methods.

The paper tackles the challenges of high computational cost and difficulty in estimating metric-scale depth in Full Surround Monocular Depth Estimation by proposing a novel knowledge distillation strategy that transfers knowledge from a foundation model to a lightweight network, achieving a favorable trade-off between performance and efficiency while meeting real-time requirements.

Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real-time performance, and (2) difficulty in estimating metric-scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme--traditionally used in classification--with a depth binning module to enhance scale consistency. Specifically, we introduce a cross-interaction knowledge distillation scheme that distills the scale-invariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Experiments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements.

View on arXiv PDF

Similar