CVAIFeb 13, 2025

Object-Centric Latent Action Learning

arXiv:2502.09680v27 citationsh-index: 12
AI Analysis

This addresses the challenge of robust imitation learning in visually complex environments for embodied AI, though it is incremental as it builds on existing latent action policy optimization methods.

The paper tackled the problem of learning from unlabeled internet video data for embodied AI, which is bottlenecked by action-correlated visual distractors, by proposing an object-centric latent action learning framework that mitigates distractor effects by 50% in downstream tasks.

Leveraging vast amounts of unlabeled internet video data for embodied AI is currently bottlenecked by the lack of action labels and the presence of action-correlated visual distractors. Although recent latent action policy optimization (LAPO) has shown promise in inferring proxy-action labels from visual observations, its performance degrades significantly when distractors are present. To address this limitation, we propose a novel object-centric latent action learning framework that centers on objects rather than pixels. We leverage self-supervised object-centric pretraining to disentangle action-related and distracting dynamics. This allows LAPO to focus on task-relevant interactions, resulting in more robust proxy-action labels, enabling better imitation learning and efficient adaptation of the agent with just a few action-labeled trajectories. We evaluated our method in eight visually complex tasks across the Distracting Control Suite (DCS) and Distracting MetaWorld (DMW). Our results show that object-centric pretraining mitigates the negative effects of distractors by 50%, as measured by downstream task performance: average return (DCS) and success rate (DMW).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes