VideoPCDNet: Video Parsing and Prediction with Phase Correlation Networks
This addresses the problem of unsupervised learning of object representations and dynamics in videos for applications like planning and reasoning, though it appears incremental as it builds on existing object-centric methods.
The paper tackles unsupervised object-centric video decomposition and prediction by introducing VideoPCDNet, which uses frequency-domain phase correlation to parse videos into object components and model motion, resulting in improved performance over baselines on synthetic datasets.
Understanding and predicting video content is essential for planning and reasoning in dynamic environments. Despite advancements, unsupervised learning of object representations and dynamics remains challenging. We present VideoPCDNet, an unsupervised framework for object-centric video decomposition and prediction. Our model uses frequency-domain phase correlation techniques to recursively parse videos into object components, which are represented as transformed versions of learned object prototypes, enabling accurate and interpretable tracking. By explicitly modeling object motion through a combination of frequency domain operations and lightweight learned modules, VideoPCDNet enables accurate unsupervised object tracking and prediction of future video frames. In our experiments, we demonstrate that VideoPCDNet outperforms multiple object-centric baseline models for unsupervised tracking and prediction on several synthetic datasets, while learning interpretable object and motion representations.