CV AI ROApr 9, 2024

Playing to Vision Foundation Model's Strengths in Stereo Matching

arXiv:2404.06261v118.636 citationsh-index: 11IEEE Trans Intell Veh

Originality Highly original

AI Analysis

It addresses the problem of improving 3D perception for intelligent vehicles by proposing a new paradigm that enhances stereo matching performance and generalizability.

This paper tackles the challenge of adapting vision foundation models (VFMs) to stereo matching, a geometric vision task where VFMs typically underperform, by introducing ViTAStereo, which achieves top rank on the KITTI Stereo 2012 dataset with a 7.9% improvement in error pixels over the second-best method.

Stereo matching has become a key technique for 3D environment perception in intelligent vehicles. For a considerable time, convolutional neural networks (CNNs) have remained the mainstream choice for feature extraction in this domain. Nonetheless, there is a growing consensus that the existing paradigm should evolve towards vision foundation models (VFM), particularly those developed based on vision Transformers (ViTs) and pre-trained through self-supervision on extensive, unlabeled datasets. While VFMs are adept at extracting informative, general-purpose visual features, specifically for dense prediction tasks, their performance often lacks in geometric vision tasks. This study serves as the first exploration of a viable approach for adapting VFMs to stereo matching. Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention. The first module initializes feature pyramids, while the latter two aggregate stereo and multi-scale contextual information into fine-grained features, respectively. ViTAStereo, which combines ViTAS with cost volume-based stereo matching back-end processes, achieves the top rank on the KITTI Stereo 2012 dataset and outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels. Additional experiments across diverse scenarios further demonstrate its superior generalizability compared to all other state-of-the-art approaches. We believe this new paradigm will pave the way for the next generation of stereo matching networks.

View on arXiv PDF

Similar