Mimicking Better by Matching the Approximate Action Distribution
This addresses the challenge of imitation learning when expert actions are unavailable, offering a more efficient and stable approach for robotics and control tasks, though it appears incremental as it builds on existing methods like inverse dynamics.
The paper tackles the problem of imitation learning from observations without expert actions by introducing MAAD, a sample-efficient on-policy algorithm that uses an inverse dynamics model to infer action distributions and regularizes the policy to align with them, achieving expert performance with fewer interactions and outperforming state-of-the-art methods in MuJoCo environments.
In this paper, we introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations. MAAD utilizes a surrogate reward signal, which can be derived from various sources such as adversarial games, trajectory matching objectives, or optimal transport criteria. To compensate for the non-availability of expert actions, we rely on an inverse dynamics model that infers plausible actions distribution given the expert's state-state transitions; we regularize the imitator's policy by aligning it to the inferred action distribution. MAAD leads to significantly improved sample efficiency and stability. We demonstrate its effectiveness in a number of MuJoCo environments, both int the OpenAI Gym and the DeepMind Control Suite. We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods. Remarkably, MAAD often stands out as the sole method capable of attaining expert performance levels, underscoring its simplicity and efficacy.