Outdoor inverse rendering from a single image using multiview self-supervision
This addresses the challenge of ill-posed inverse rendering from single images for applications in computer vision and graphics, though it is incremental by building on existing methods with novel supervision.
The paper tackles the problem of performing scene-level inverse rendering to recover shape, reflectance, and lighting from a single uncontrolled image using a fully convolutional neural network, achieving results evaluated on benchmarks for inverse rendering, normal map estimation, and intrinsic image decomposition.
In this paper we show how to perform scene-level inverse rendering to recover shape, reflectance and lighting from a single, uncontrolled image using a fully convolutional neural network. The network takes an RGB image as input, regresses albedo, shadow and normal maps from which we infer least squares optimal spherical harmonic lighting coefficients. Our network is trained using large uncontrolled multiview and timelapse image collections without ground truth. By incorporating a differentiable renderer, our network can learn from self-supervision. Since the problem is ill-posed we introduce additional supervision. Our key insight is to perform offline multiview stereo (MVS) on images containing rich illumination variation. From the MVS pose and depth maps, we can cross project between overlapping views such that Siamese training can be used to ensure consistent estimation of photometric invariants. MVS depth also provides direct coarse supervision for normal map estimation. We believe this is the first attempt to use MVS supervision for learning inverse rendering. In addition, we learn a statistical natural illumination prior. We evaluate performance on inverse rendering, normal map estimation and intrinsic image decomposition benchmarks.