Predicting 3D representations for Dynamic Scenes
This work addresses the challenge of 4D physical world modeling for applications in computer vision and robotics, representing an incremental advance over methods focused on future frame prediction.
The paper tackles the problem of predicting explicit 3D representations from monocular video streams for dynamic scenes, achieving top results in dynamic radiance field prediction on NVIDIA dynamic scenes and demonstrating superior generalizability to unseen scenarios.
We present a novel framework for dynamic radiance field prediction given monocular video streams. Unlike previous methods that primarily focus on predicting future frames, our method goes a step further by generating explicit 3D representations of the dynamic scene. The framework builds on two core designs. First, we adopt an ego-centric unbounded triplane to explicitly represent the dynamic physical world. Second, we develop a 4D-aware transformer to aggregate features from monocular videos to update the triplane. Coupling these two designs enables us to train the proposed model with large-scale monocular videos in a self-supervised manner. Our model achieves top results in dynamic radiance field prediction on NVIDIA dynamic scenes, demonstrating its strong performance on 4D physical world modeling. Besides, our model shows a superior generalizability to unseen scenarios. Notably, we find that our approach emerges capabilities for geometry and semantic learning.