Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene
This addresses the challenge of 3D scene understanding from 2D images for applications in computer vision and robotics, representing an incremental advancement in representation methods.
The paper tackles the problem of recovering 3D structure from a single 2D image by factoring it into layout, shape, and pose, using a convolutional neural network approach, and demonstrates its effectiveness through quantitative and qualitative comparisons on a large indoor scene dataset.
The goal of this paper is to take a single 2D image of a scene and recover the 3D structure in terms of a small set of factors: a layout representing the enclosing surfaces as well as a set of objects represented in terms of shape and pose. We propose a convolutional neural network-based approach to predict this representation and benchmark it on a large dataset of indoor scenes. Our experiments evaluate a number of practical design questions, demonstrate that we can infer this representation, and quantitatively and qualitatively demonstrate its merits compared to alternate representations.