Unsupervised Discovery of Parts, Structure, and Dynamics
This addresses the challenge of unsupervised object understanding for computer vision, though it appears incremental as it builds on existing disentanglement and dynamics modeling approaches.
The paper tackles the problem of learning hierarchical object representations and dynamics from unlabeled videos, resulting in a model that effectively segments parts, builds structure, and predicts motion across multiple datasets.
Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future. In this paper, we propose a novel formulation that simultaneously learns a hierarchical, disentangled object representation and a dynamics model for object parts from unlabeled videos. Our Parts, Structure, and Dynamics (PSD) model learns to, first, recognize the object parts via a layered image representation; second, predict hierarchy via a structural descriptor that composes low-level concepts into a hierarchical structure; and third, model the system dynamics by predicting the future. Experiments on multiple real and synthetic datasets demonstrate that our PSD model works well on all three tasks: segmenting object parts, building their hierarchical structure, and capturing their motion distributions.