FaDIV-Syn: Fast Depth-Independent View Synthesis using Soft Masks and Implicit Blending
This addresses the need for fast and robust view synthesis in dynamic scenes for robotics, such as VR teleoperation, though it is incremental as it builds on plane sweep volume techniques.
The paper tackles the problem of slow and depth-dependent novel view synthesis for robotic applications by proposing a method that avoids explicit depth estimation, achieving real-time performance at 540p and outperforming state-of-the-art extrapolation methods on the RealEstate10k dataset.
Novel view synthesis is required in many robotic applications, such as VR teleoperation and scene reconstruction. Existing methods are often too slow for these contexts, cannot handle dynamic scenes, and are limited by their explicit depth estimation stage, where incorrect depth predictions can lead to large projection errors. Our proposed method runs in real time on live streaming data and avoids explicit depth estimation by efficiently warping input images into the target frame for a range of assumed depth planes. The resulting plane sweep volume (PSV) is directly fed into our network, which first estimates soft PSV masks in a self-supervised manner, and then directly produces the novel output view. This improves efficiency and performance on transparent, reflective, thin, and feature-less scene parts. FaDIV-Syn can perform both interpolation and extrapolation tasks at 540p in real-time and outperforms state-of-the-art extrapolation methods on the large-scale RealEstate10k dataset. We thoroughly evaluate ablations, such as removing the Soft-Masking network, training from fewer examples as well as generalization to higher resolutions and stronger depth discretization. Our implementation is available.