ZeST-NeRF: Using temporal aggregation for Zero-Shot Temporal NeRFs
This addresses the challenge of costly training and inference for dynamic scene representation in media production, though it appears incremental as it builds on existing NeRF methods.
The paper tackles the problem of generating temporal Neural Radiance Fields (NeRFs) for new scenes without retraining, using multi-view synthesis and scene flow-field estimation trained on unrelated scenes, resulting in a 15% quantitative improvement and better visual outcomes.
In the field of media production, video editing techniques play a pivotal role. Recent approaches have had great success at performing novel view image synthesis of static scenes. But adding temporal information adds an extra layer of complexity. Previous models have focused on implicitly representing static and dynamic scenes using NeRF. These models achieve impressive results but are costly at training and inference time. They overfit an MLP to describe the scene implicitly as a function of position. This paper proposes ZeST-NeRF, a new approach that can produce temporal NeRFs for new scenes without retraining. We can accurately reconstruct novel views using multi-view synthesis techniques and scene flow-field estimation, trained only with unrelated scenes. We demonstrate how existing state-of-the-art approaches from a range of fields cannot adequately solve this new task and demonstrate the efficacy of our solution. The resulting network improves quantitatively by 15% and produces significantly better visual results.