Contrastive Learning of Structured World Models
This addresses the problem of enabling AI systems to understand compositional structures in environments, which is incremental progress in unsupervised representation learning for robotics and simulation tasks.
The paper tackles the challenge of learning structured world models from raw sensory data by introducing Contrastively-trained Structured World Models (C-SWMs), which use contrastive learning and graph neural networks to discover objects and relations without supervision, and demonstrate improved performance over pixel reconstruction models in structured environments like Atari games and physics simulations.
A structured understanding of our world in terms of objects, relations, and hierarchies is an important component of human cognition. Learning such a structured world model from raw sensory data remains a challenge. As a step towards this goal, we introduce Contrastively-trained Structured World Models (C-SWMs). C-SWMs utilize a contrastive approach for representation learning in environments with compositional structure. We structure each state embedding as a set of object representations and their relations, modeled by a graph neural network. This allows objects to be discovered from raw pixel observations without direct supervision as part of the learning process. We evaluate C-SWMs on compositional environments involving multiple interacting objects that can be manipulated independently by an agent, simple Atari games, and a multi-object physics simulation. Our experiments demonstrate that C-SWMs can overcome limitations of models based on pixel reconstruction and outperform typical representatives of this model class in highly structured environments, while learning interpretable object-based representations.