Efficient Imitation under Misspecification
This addresses the problem of efficient and safe imitation learning for robotics or AI systems with perceptual or morphological differences, though it is incremental as it builds on prior local-search methods.
The paper tackles imitation learning under misspecification, where the learner cannot fully replicate expert behavior, by proving that local-search inverse reinforcement learning algorithms avoid compounding errors under a structural condition and broadening the search to states reachable by learner policies, with experimental validation on various misspecification sources.
We consider the problem of imitation learning under misspecification: settings where the learner is fundamentally unable to replicate expert behavior everywhere. This is often true in practice due to differences in observation space and action space expressiveness (e.g. perceptual or morphological differences between robots and humans). Given the learner must make some mistakes in the misspecified setting, interaction with the environment is fundamentally required to figure out which mistakes are particularly costly and lead to compounding errors. However, given the computational cost and safety concerns inherent in interaction, we'd like to perform as little of it as possible while ensuring we've learned a strong policy. Accordingly, prior work has proposed a flavor of efficient inverse reinforcement learning algorithms that merely perform a computationally efficient local search procedure with strong guarantees in the realizable setting. We first prove that under a novel structural condition we term reward-agnostic policy completeness, these sorts of local-search based IRL algorithms are able to avoid compounding errors. We then consider the question of where we should perform local search in the first place, given the learner may not be able to "walk on a tightrope" as well as the expert in the misspecified setting. We prove that in the misspecified setting, it is beneficial to broaden the set of states on which local search is performed to include those reachable by good policies the learner can actually play. We then experimentally explore a variety of sources of misspecification and how offline data can be used to effectively broaden where we perform local search from.