STaR: Self-supervised Tracking and Reconstruction of Rigid Objects in Motion with Neural Rendering
This work addresses the problem of reconstructing and tracking rigid objects in motion for computer vision researchers, enabling novel view synthesis and animation without manual annotation.
This paper introduces STaR, a self-supervised method for tracking and reconstructing rigid objects in motion from multi-view RGB videos. It decomposes dynamic scenes into two neural radiance fields and a set of rigid poses, enabling photorealistic novel view rendering across spatial and temporal axes and animation of unseen object motion.
We present STaR, a novel method that performs Self-supervised Tracking and Reconstruction of dynamic scenes with rigid motion from multi-view RGB videos without any manual annotation. Recent work has shown that neural networks are surprisingly effective at the task of compressing many views of a scene into a learned function which maps from a viewing ray to an observed radiance value via volume rendering. Unfortunately, these methods lose all their predictive power once any object in the scene has moved. In this work, we explicitly model rigid motion of objects in the context of neural representations of radiance fields. We show that without any additional human specified supervision, we can reconstruct a dynamic scene with a single rigid object in motion by simultaneously decomposing it into its two constituent parts and encoding each with its own neural representation. We achieve this by jointly optimizing the parameters of two neural radiance fields and a set of rigid poses which align the two fields at each frame. On both synthetic and real world datasets, we demonstrate that our method can render photorealistic novel views, where novelty is measured on both spatial and temporal axes. Our factored representation furthermore enables animation of unseen object motion.