LGAIFeb 28, 2022

LobsDICE: Offline Learning from Observation via Stationary Distribution Correction Estimation

arXiv:2202.13536v228 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of imitation learning in real-world scenarios where expert actions are inaccessible and environment interactions are costly or risky, though it is incremental as it builds on existing distribution correction methods.

The paper tackles the problem of offline learning from observation (LfO), where an agent mimics expert behavior using only state demonstrations without expert actions or environment interaction, by proposing LobsDICE, which optimizes stationary distributions to minimize divergence between expert and agent state-transition distributions. The result shows that LobsDICE outperforms strong baselines in offline LfO tasks.

We consider the problem of learning from observation (LfO), in which the agent aims to mimic the expert's behavior from the state-only demonstrations by experts. We additionally assume that the agent cannot interact with the environment but has access to the action-labeled transition data collected by some agents with unknown qualities. This offline setting for LfO is appealing in many real-world scenarios where the ground-truth expert actions are inaccessible and the arbitrary environment interactions are costly or risky. In this paper, we present LobsDICE, an offline LfO algorithm that learns to imitate the expert policy via optimization in the space of stationary distributions. Our algorithm solves a single convex minimization problem, which minimizes the divergence between the two state-transition distributions induced by the expert and the agent policy. Through an extensive set of offline LfO tasks, we show that LobsDICE outperforms strong baseline methods.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes