LGJun 29, 2024

A Bayesian Solution To The Imitation Gap

Risto Vuorio, Mattie Fellows, Cong Lu, Clémence Grislain, Shimon Whiteson

arXiv:2407.00495v17.92 citations

Originality Incremental advance

AI Analysis

This addresses a specific limitation in imitation learning for agents in environments without reward signals, offering a method to handle observability mismatches, though it appears incremental as it builds on Bayesian inverse reinforcement learning.

The paper tackles the imitation gap problem in imitation learning, where differences in observability between expert and agent can cause naive imitation to fail, by proposing a Bayesian solution that infers rewards from demonstrations and learns a Bayes-optimal policy, resulting in the agent successfully exploring when needed while behaving optimally otherwise.

In many real-world settings, an agent must learn to act in environments where no reward signal can be specified, but a set of expert demonstrations is available. Imitation learning (IL) is a popular framework for learning policies from such demonstrations. However, in some cases, differences in observability between the expert and the agent can give rise to an imitation gap such that the expert's policy is not optimal for the agent and a naive application of IL can fail catastrophically. In particular, if the expert observes the Markov state and the agent does not, then the expert will not demonstrate the information-gathering behavior needed by the agent but not the expert. In this paper, we propose a Bayesian solution to the Imitation Gap (BIG), first using the expert demonstrations, together with a prior specifying the cost of exploratory behavior that is not demonstrated, to infer a posterior over rewards with Bayesian inverse reinforcement learning (IRL). BIG then uses the reward posterior to learn a Bayes-optimal policy. Our experiments show that BIG, unlike IL, allows the agent to explore at test time when presented with an imitation gap, whilst still learning to behave optimally using expert demonstrations when no such gap exists.

View on arXiv PDF

Similar