End-to-End Learning of Multi-scale Convolutional Neural Network for Stereo Matching
This work addresses stereo matching for computer vision applications, but it is incremental as it builds on existing CNN-based methods with specific improvements.
The paper tackled the problem of fusing contextual semantic information and fine-grained details in stereo matching by proposing a Multi-scale Features Network (MSFNet), which achieved state-of-the-art performance on Scene Flow and KITTI 2015 benchmarks.
Deep neural networks have shown excellent performance in stereo matching task. Recently CNN-based methods have shown that stereo matching can be formulated as a supervised learning task. However, less attention is paid on the fusion of contextual semantic information and details. To tackle this problem, we propose a network for disparity estimation based on abundant contextual details and semantic information, called Multi-scale Features Network (MSFNet). First, we design a new structure to encode rich semantic information and fine-grained details by fusing multi-scale features. And we combine the advantages of element-wise addition and concatenation, which is conducive to merge semantic information with details. Second, a guidance mechanism is introduced to guide the network to automatically focus more on the unreliable regions. Third, we formulate the consistency check as an error map, obtained by the low stage features with fine-grained details. Finally, we adopt the consistency checking between the left feature and the synthetic left feature to refine the initial disparity. Experiments on Scene Flow and KITTI 2015 benchmark demonstrated that the proposed method can achieve the state-of-the-art performance.