CVJun 26, 2025

ESMStereo: Enhanced ShuffleMixer Disparity Upsampling for Real-Time and Accurate Stereo Matching

arXiv:2506.21091v1h-index: 4
Originality Incremental advance
AI Analysis

This addresses the need for fast and accurate stereo matching in autonomous systems, representing an incremental improvement over existing methods.

The paper tackles the trade-off between accuracy and real-time performance in stereo matching by proposing ESMStereo, which uses an Enhanced Shuffle Mixer to upsample small-scale cost volumes, achieving 116 FPS on high-end GPUs with improved accuracy.

Stereo matching has become an increasingly important component of modern autonomous systems. Developing deep learning-based stereo matching models that deliver high accuracy while operating in real-time continues to be a major challenge in computer vision. In the domain of cost-volume-based stereo matching, accurate disparity estimation depends heavily on large-scale cost volumes. However, such large volumes store substantial redundant information and also require computationally intensive aggregation units for processing and regression, making real-time performance unattainable. Conversely, small-scale cost volumes followed by lightweight aggregation units provide a promising route for real-time performance, but lack sufficient information to ensure highly accurate disparity estimation. To address this challenge, we propose the Enhanced Shuffle Mixer (ESM) to mitigate information loss associated with small-scale cost volumes. ESM restores critical details by integrating primary features into the disparity upsampling unit. It quickly extracts features from the initial disparity estimation and fuses them with image features. These features are mixed by shuffling and layer splitting then refined through a compact feature-guided hourglass network to recover more detailed scene geometry. The ESM focuses on local contextual connectivity with a large receptive field and low computational cost, leading to the reconstruction of a highly accurate disparity map at real-time. The compact version of ESMStereo achieves an inference speed of 116 FPS on high-end GPUs and 91 FPS on the AGX Orin.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes