Neural Inverse Rendering of an Indoor Scene from a Single Image
This addresses the challenge of comprehensive scene understanding for applications like augmented reality and robotics, though it is incremental as it builds on prior inverse rendering work.
The authors tackled the problem of jointly estimating albedo, normals, and lighting for indoor scenes from a single image, achieving state-of-the-art performance over methods that estimate only one or more attributes.
Inverse rendering aims to estimate physical attributes of a scene, e.g., reflectance, geometry, and lighting, from image(s). Inverse rendering has been studied primarily for single objects or with methods that solve for only one of the scene attributes. We propose the first learning-based approach that jointly estimates albedo, normals, and lighting of an indoor scene from a single image. Our key contribution is the Residual Appearance Renderer (RAR), which can be trained to synthesize complex appearance effects (e.g., inter-reflection, cast shadows, near-field illumination, and realistic shading), which would be neglected otherwise. This enables us to perform self-supervised learning on real data using a reconstruction loss, based on re-synthesizing the input image from the estimated components. We finetune with real data after pretraining with synthetic data. To this end, we use physically-based rendering to create a large-scale synthetic dataset, which is a significant improvement over prior datasets. Experimental results show that our approach outperforms state-of-the-art methods that estimate one or more scene attributes.