ViPro-2: Unsupervised State Estimation via Integrated Dynamics for Guiding Video Prediction
This work improves video prediction models for applications in robotics and simulation by enabling more robust state estimation, though it is incremental over ViPro.
The paper tackles the problem of video prediction by addressing a shortcut in prior work (ViPro) that prevented accurate state estimation from observations when previous states were noisy, and demonstrates unsupervised state inference without requiring ground truth initial states while extending the dataset to 3D for realism.
Predicting future video frames is a challenging task with many downstream applications. Previous work has shown that procedural knowledge enables deep models for complex dynamical settings, however their model ViPro assumed a given ground truth initial symbolic state. We show that this approach led to the model learning a shortcut that does not actually connect the observed environment with the predicted symbolic state, resulting in the inability to estimate states given an observation if previous states are noisy. In this work, we add several improvements to ViPro that enables the model to correctly infer states from observations without providing a full ground truth state in the beginning. We show that this is possible in an unsupervised manner, and extend the original Orbits dataset with a 3D variant to close the gap to real world scenarios.