LGAIRONov 17, 2020

Efficient Exploration of Reward Functions in Inverse Reinforcement Learning via Bayesian Optimization

arXiv:2011.08541v132 citations
AI Analysis

This work provides a more efficient method for identifying multiple plausible reward functions in Inverse Reinforcement Learning, which is crucial for researchers and practitioners working on value alignment and robot learning from demonstration where the true reward function is ambiguous.

This paper addresses the ill-posed nature of Inverse Reinforcement Learning (IRL) by proposing Bayesian Optimization-IRL (BO-IRL), a framework that efficiently explores the reward function space to identify multiple solutions consistent with expert demonstrations. BO-IRL achieves this by using Bayesian Optimization with a novel kernel that projects policy-invariant reward functions to a single latent space point and ensures nearby latent points correspond to similar likelihoods, thereby minimizing expensive policy optimizations.

The problem of inverse reinforcement learning (IRL) is relevant to a variety of tasks including value alignment and robot learning from demonstration. Despite significant algorithmic contributions in recent years, IRL remains an ill-posed problem at its core; multiple reward functions coincide with the observed behavior and the actual reward function is not identifiable without prior knowledge or supplementary information. This paper presents an IRL framework called Bayesian optimization-IRL (BO-IRL) which identifies multiple solutions that are consistent with the expert demonstrations by efficiently exploring the reward function space. BO-IRL achieves this by utilizing Bayesian Optimization along with our newly proposed kernel that (a) projects the parameters of policy invariant reward functions to a single point in a latent space and (b) ensures nearby points in the latent space correspond to reward functions yielding similar likelihoods. This projection allows the use of standard stationary kernels in the latent space to capture the correlations present across the reward function space. Empirical results on synthetic and real-world environments (model-free and model-based) show that BO-IRL discovers multiple reward functions while minimizing the number of expensive exact policy optimizations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes