LGAIOCMLMar 20, 2019

Batch Policy Learning under Constraints

arXiv:1903.08738v1393 citations
Originality Incremental advance
AI Analysis

This addresses the problem of learning safe and efficient policies from pre-collected data for real-world applications like autonomous driving, though it appears incremental as it builds on existing batch RL and OPE techniques.

The paper tackles batch policy learning with multiple constraints by proposing a meta-algorithm that integrates batch reinforcement learning and online learning, along with a new off-policy evaluation method. It achieves strong empirical results in domains like simulated car driving, with the OPE method outperforming other techniques in high-dimensional settings.

When learning policies for real-world domains, two important questions arise: (i) how to efficiently use pre-collected off-policy, non-optimal behavior data; and (ii) how to mediate among different competing objectives and constraints. We thus study the problem of batch policy learning under multiple constraints, and offer a systematic solution. We first propose a flexible meta-algorithm that admits any batch reinforcement learning and online learning procedure as subroutines. We then present a specific algorithmic instantiation and provide performance guarantees for the main objective and all constraints. To certify constraint satisfaction, we propose a new and simple method for off-policy policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving. We also show experimentally that our OPE method outperforms other popular OPE techniques on a standalone basis, especially in a high-dimensional setting.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes