CVSep 4, 2024

MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling

arXiv:2409.02846v12 citationsh-index: 7
Originality Highly original
AI Analysis

This work addresses a key bottleneck in stereo matching for computer vision applications, offering a novel method to improve Transformer performance in a domain where CNNs have traditionally dominated.

The paper tackles the data scarcity issue in Transformer-based stereo matching by proposing MaDis-Stereo, which uses masked image modeling and knowledge distillation to enhance locality inductive bias, achieving state-of-the-art performance on benchmarks like ETH3D and KITTI 2015.

In stereo matching, CNNs have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM) in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual challenge of reconstructing masked tokens and subsequently performing stereo matching poses significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher), updated via Exponential Moving Average (EMA), along with the original stereo model (student), where teacher predictions serve as pseudo supervisory signals to effectively distill knowledge into the student model. State-of-the-arts performance is achieved with the proposed method on several stereo matching such as ETH3D and KITTI 2015. Additionally, to demonstrate that our model effectively leverages locality inductive bias, we provide the attention distance measurement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes