CV AIDec 14, 2024

Sample-efficient Unsupervised Policy Cloning from Ensemble Self-supervised Labeled Videos

arXiv:2412.10778v27.65 citationsh-index: 20ICRA

Originality Highly original

AI Analysis

This addresses the challenge of sample-efficient and unsupervised policy learning for scenarios where traditional supervision is expensive or unavailable, representing a novel approach rather than an incremental improvement.

The paper tackles the problem of learning policies from action-free videos without rewards or expert supervision, proposing the UPESV framework that achieves state-of-the-art performance in interaction-limited settings, outperforming five baselines on 12 out of 16 tasks.

Current advanced policy learning methodologies have demonstrated the ability to develop expert-level strategies when provided enough information. However, their requirements, including task-specific rewards, action-labeled expert trajectories, and huge environmental interactions, can be expensive or even unavailable in many scenarios. In contrast, humans can efficiently acquire skills within a few trials and errors by imitating easily accessible internet videos, in the absence of any other supervision. In this paper, we try to let machines replicate this efficient watching-and-learning process through Unsupervised Policy from Ensemble Self-supervised labeled Videos (UPESV), a novel framework to efficiently learn policies from action-free videos without rewards and any other expert supervision. UPESV trains a video labeling model to infer the expert actions in expert videos through several organically combined self-supervised tasks. Each task performs its duties, and they together enable the model to make full use of both action-free videos and reward-free interactions for robust dynamics understanding and advanced action prediction. Simultaneously, UPESV clones a policy from the labeled expert videos, in turn collecting environmental interactions for self-supervised tasks. After a sample-efficient, unsupervised, and iterative training process, UPESV obtains an advanced policy based on a robust video labeling model. Extensive experiments in sixteen challenging procedurally generated environments demonstrate that the proposed UPESV achieves state-of-the-art interaction-limited policy learning performance (outperforming five current advanced baselines on 12/16 tasks) without exposure to any other supervision except for videos.

View on arXiv PDF

Similar