ES-Net: An Efficient Stereo Matching Network
This work addresses the need for fast and accurate stereo matching in domains like autonomous driving, though it is incremental as it builds on existing methods with efficiency improvements.
The paper tackles the problem of dense stereo matching for real-world applications like autonomous driving by proposing ES-Net, an efficient network that avoids slow 3D convolutions, achieving state-of-the-art performance on datasets such as Scene Flow, DrivingStereo, and KITTI-2015.
Dense stereo matching with deep neural networks is of great interest to the research community. Existing stereo matching networks typically use slow and computationally expensive 3D convolutions to improve the performance, which is not friendly to real-world applications such as autonomous driving. In this paper, we propose the Efficient Stereo Network (ESNet), which achieves high performance and efficient inference at the same time. ESNet relies only on 2D convolution and computes multi-scale cost volume efficiently using a warping-based method to improve the performance in regions with fine-details. In addition, we address the matching ambiguity issue in the occluded region by proposing ESNet-M, a variant of ESNet that additionally estimates an occlusion mask without supervision. We further improve the network performance by proposing a new training scheme that includes dataset scheduling and unsupervised pre-training. Compared with other low-cost dense stereo depth estimation methods, our proposed approach achieves state-of-the-art performance on the Scene Flow [1], DrivingStereo [2], and KITTI-2015 dataset [3]. Our code will be made available.