CVDec 2, 2024

AVS-Net: Audio-Visual Scale Net for Self-supervised Monocular Metric Depth Estimation

arXiv:2412.01637v11 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses the challenge of accurate metric depth estimation for computer vision applications, offering a novel integration of audio cues to enhance existing methods, though it builds incrementally on prior work using echoes.

The paper tackles the problem of metric depth estimation from monocular videos, which suffers from poor generalization and requires supervised data for scale-correct training, by using audible echoes to improve depth prediction in both supervised and self-supervised settings, showing improvements in state-of-the-art approaches and enabling scale correction.

Metric depth prediction from monocular videos suffers from bad generalization between datasets and requires supervised depth data for scale-correct training. Self-supervised training using multi-view reconstruction can benefit from large scale natural videos but not provide correct scale, limiting its benefits. Recently, reflecting audible Echoes off objects is investigated for improved depth prediction and was shown to be sufficient to reconstruct objects at scale even without a visual signal. Because Echoes travel at fixed speed, they have the potential to resolve ambiguities in object scale and appearance. However, predicting depth end-to-end from sound and vision cannot benefit from unsupervised depth prediction approaches, which can process large scale data without sound annotation. In this work we show how Echoes can benefit depth prediction in two ways: When learning metric depth learned from supervised data and as supervisory signal for scale-correct self-supervised training. We show how we can improve the predictions of several state-of-the-art approaches and how the method can scale-correct a self-supervised depth approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes