Unsupervised learning of depth and motion
This addresses the challenge of 3-D scene understanding for computer vision applications, though it appears incremental as it builds on biologically inspired units and existing learning frameworks.
The paper tackles the problem of jointly estimating depth and motion from visual data by learning interrelations between images from multiple cameras or video frames. The result shows state-of-the-art performance in 3-D activity analysis and significantly outperforms existing hand-engineered 3-D motion features.
We present a model for the joint estimation of disparity and motion. The model is based on learning about the interrelations between images from multiple cameras, multiple frames in a video, or the combination of both. We show that learning depth and motion cues, as well as their combinations, from data is possible within a single type of architecture and a single type of learning algorithm, by using biologically inspired "complex cell" like units, which encode correlations between the pixels across image pairs. Our experimental results show that the learning of depth and motion makes it possible to achieve state-of-the-art performance in 3-D activity analysis, and to outperform existing hand-engineered 3-D motion features by a very large margin.