UWStereo: A Large Synthetic Dataset for Underwater Stereo Matching
This addresses the problem of limited training data for underwater stereo matching, which is crucial for applications like marine robotics and exploration, but it is incremental as it builds on existing stereo matching methods with a new dataset and minor architectural improvements.
The authors tackled the lack of ground truth data for underwater stereo matching by introducing UWStereo, a large synthetic dataset with 29,568 stereo image pairs and dense disparity annotations, and found that current models struggle to generalize to new domains, leading them to propose a new strategy with cross-domain image reconstruction and cross-view attention enhancement.
Despite recent advances in stereo matching, the extension to intricate underwater settings remains unexplored, primarily owing to: 1) the reduced visibility, low contrast, and other adverse effects of underwater images; 2) the difficulty in obtaining ground truth data for training deep learning models, i.e. simultaneously capturing an image and estimating its corresponding pixel-wise depth information in underwater environments. To enable further advance in underwater stereo matching, we introduce a large synthetic dataset called UWStereo. Our dataset includes 29,568 synthetic stereo image pairs with dense and accurate disparity annotations for left view. We design four distinct underwater scenes filled with diverse objects such as corals, ships and robots. We also induce additional variations in camera model, lighting, and environmental effects. In comparison with existing underwater datasets, UWStereo is superior in terms of scale, variation, annotation, and photo-realistic image quality. To substantiate the efficacy of the UWStereo dataset, we undertake a comprehensive evaluation compared with nine state-of-the-art algorithms as benchmarks. The results indicate that current models still struggle to generalize to new domains. Hence, we design a new strategy that learns to reconstruct cross domain masked images before stereo matching training and integrate a cross view attention enhancement module that aggregates long-range content information to enhance the generalization ability.