Inverse Rendering Techniques for Physically Grounded Image Editing
This work addresses the challenge of teaching computers to make high-level scene observations like humans, with applications in robotics and computer graphics, though it appears incremental as it builds on existing inverse rendering methods.
The paper tackles the problem of estimating intrinsic scene properties like geometry, materials, and lighting from a single image using inverse rendering techniques, enabling physically grounded image editing that allows for seamless object addition, removal, or relocation in seconds.
From a single picture of a scene, people can typically grasp the spatial layout immediately and even make good guesses at materials properties and where light is coming from to illuminate the scene. For example, we can reliably tell which objects occlude others, what an object is made of and its rough shape, regions that are illuminated or in shadow, and so on. It is interesting how little is known about our ability to make these determinations; as such, we are still not able to robustly "teach" computers to make the same high-level observations as people. This document presents algorithms for understanding intrinsic scene properties from single images. The goal of these inverse rendering techniques is to estimate the configurations of scene elements (geometry, materials, luminaires, camera parameters, etc) using only information visible in an image. Such algorithms have applications in robotics and computer graphics. One such application is in physically grounded image editing: photo editing made easier by leveraging knowledge of the physical space. These applications allow sophisticated editing operations to be performed in a matter of seconds, enabling seamless addition, removal, or relocation of objects in images.