LGAIFeb 25, 2021

Off-Policy Imitation Learning from Observations

arXiv:2102.13185v196 citations
Originality Incremental advance
AI Analysis

This addresses the problem of sample inefficiency in LfO for reinforcement learning applications, though it is incremental as it builds on existing distribution matching approaches.

The paper tackles the challenge of sample-efficient Learning from Observations (LfO) without expert actions by proposing an off-policy optimization method with an inverse action model, achieving results comparable to state-of-the-art in locomotion tasks.

Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit through the reuse of incomplete resources. Compared to conventional imitation learning (IL), LfO is more challenging because of the lack of expert action guidance. In both conventional IL and LfO, distribution matching is at the heart of their foundation. Traditional distribution matching approaches are sample-costly which depend on on-policy transitions for policy learning. Towards sample-efficiency, some off-policy solutions have been proposed, which, however, either lack comprehensive theoretical justifications or depend on the guidance of expert actions. In this work, we propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner. To further accelerate the learning procedure, we regulate the policy update with an inverse action model, which assists distribution matching from the perspective of mode-covering. Extensive empirical results on challenging locomotion tasks indicate that our approach is comparable with state-of-the-art in terms of both sample-efficiency and asymptotic performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes