S3IM: Stochastic Structural SIMilarity and Its Unreasonable Effectiveness for Neural Fields
This addresses the limitation of inefficient supervision in neural fields for 3D scene representation, offering a novel training paradigm that enhances performance in tasks like novel view synthesis and surface reconstruction, though it is incremental as it builds on existing neural field methods.
The paper tackles the problem of neural field methods like NeRF relying on point-wise losses by introducing a Stochastic Structural SIMilarity (S3IM) loss that uses collective supervision from distant pixels, resulting in significant improvements such as a 90% drop in test MSE for novel view synthesis and a 198% F-score gain for surface reconstruction.
Recently, Neural Radiance Field (NeRF) has shown great success in rendering novel-view images of a given scene by learning an implicit representation with only posed RGB images. NeRF and relevant neural field methods (e.g., neural surface representation) typically optimize a point-wise loss and make point-wise predictions, where one data point corresponds to one pixel. Unfortunately, this line of research failed to use the collective supervision of distant pixels, although it is known that pixels in an image or scene can provide rich structural information. To the best of our knowledge, we are the first to design a nonlocal multiplex training paradigm for NeRF and relevant neural field methods via a novel Stochastic Structural SIMilarity (S3IM) loss that processes multiple data points as a whole set instead of process multiple inputs independently. Our extensive experiments demonstrate the unreasonable effectiveness of S3IM in improving NeRF and neural surface representation for nearly free. The improvements of quality metrics can be particularly significant for those relatively difficult tasks: e.g., the test MSE loss unexpectedly drops by more than 90% for TensoRF and DVGO over eight novel view synthesis tasks; a 198% F-score gain and a 64% Chamfer $L_{1}$ distance reduction for NeuS over eight surface reconstruction tasks. Moreover, S3IM is consistently robust even with sparse inputs, corrupted images, and dynamic scenes.