Rethinking Iterative Stereo Matching from Diffusion Bridge Model Perspective
This addresses a specific bottleneck in stereo matching for computer vision applications, representing an incremental advance with strong performance gains.
The paper tackles the problem of information loss in iterative stereo matching by incorporating diffusion models into the optimization process, achieving over 7% improvement on the Scene Flow dataset and state-of-the-art results in just 8 iterations.
Recently, iteration-based stereo matching has shown great potential. However, these models optimize the disparity map using RNN variants. The discrete optimization process poses a challenge of information loss, which restricts the level of detail that can be expressed in the generated disparity map. In order to address these issues, we propose a novel training approach that incorporates diffusion models into the iterative optimization process. We designed a Time-based Gated Recurrent Unit (T-GRU) to correlate temporal and disparity outputs. Unlike standard recurrent units, we employ Agent Attention to generate more expressive features. We also designed an attention-based context network to capture a large amount of contextual information. Experiments on several public benchmarks show that we have achieved competitive stereo matching performance. Our model ranks first in the Scene Flow dataset, achieving over a 7% improvement compared to competing methods, and requires only 8 iterations to achieve state-of-the-art results.