CVLGROApr 9, 2021

GATSBI: Generative Agent-centric Spatio-temporal Object Interaction

arXiv:2104.04275v111 citations
Originality Incremental advance
AI Analysis

This addresses the need for agents to understand complex spatio-temporal interactions in high-dimensional environments, with incremental improvements in object-centric modeling.

The paper tackles the problem of transforming raw visual observations into structured latent representations for vision-based decision-making, achieving superior performance in scene decomposition and video prediction compared to state-of-the-art methods.

We present GATSBI, a generative model that can transform a sequence of raw observations into a structured latent representation that fully captures the spatio-temporal context of the agent's actions. In vision-based decision-making scenarios, an agent faces complex high-dimensional observations where multiple entities interact with each other. The agent requires a good scene representation of the visual observation that discerns essential components and consistently propagates along the time horizon. Our method, GATSBI, utilizes unsupervised object-centric scene representation learning to separate an active agent, static background, and passive objects. GATSBI then models the interactions reflecting the causal relationships among decomposed entities and predicts physically plausible future states. Our model generalizes to a variety of environments where different types of robots and objects dynamically interact with each other. We show GATSBI achieves superior performance on scene decomposition and video prediction compared to its state-of-the-art counterparts.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes