CVAug 4, 2020

Learning Stereo from Single Images

arXiv:2008.01484v277 citations
AI Analysis

This reduces the need for costly ground truth data collection in stereo vision, making it easier to train networks on large image collections like COCO.

The paper tackles the problem of training stereo matching networks without requiring ground truth depth or stereo pairs by generating plausible disparity maps from single images and using them to create synthetic stereo training data. The result is a significant reduction in human effort, with the approach outperforming networks trained on standard synthetic datasets on benchmarks like KITTI, ETH3D, and Middlebury.

Supervised deep networks are among the best methods for finding correspondences in stereo image pairs. Like all supervised approaches, these networks require ground truth data during training. However, collecting large quantities of accurate dense correspondence data is very challenging. We propose that it is unnecessary to have such a high reliance on ground truth depths or even corresponding stereo pairs. Inspired by recent progress in monocular depth estimation, we generate plausible disparity maps from single images. In turn, we use those flawed disparity maps in a carefully designed pipeline to generate stereo training pairs. Training in this manner makes it possible to convert any collection of single RGB images into stereo training data. This results in a significant reduction in human effort, with no need to collect real depths or to hand-design synthetic data. We can consequently train a stereo matching network from scratch on datasets like COCO, which were previously hard to exploit for stereo. Through extensive experiments we show that our approach outperforms stereo networks trained with standard synthetic datasets, when evaluated on KITTI, ETH3D, and Middlebury.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes