AIDec 30, 2021

Constraint Sampling Reinforcement Learning: Incorporating Expertise For Faster Learning

Tong Mu, Georgios Theocharous, David Arbour, Emma Brunskill

arXiv:2112.15221v16.16 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of deploying reinforcement learning in complex, human-facing domains where slow learning can be detrimental, offering a practical solution for applications like healthcare and education.

The paper tackles the problem of slow learning and poor early performance in online reinforcement learning for human-facing applications by introducing Constraint Sampling Reinforcement Learning (CSRL), which incorporates human expertise as policy constraints to accelerate learning. The result shows that CSRL learns a good policy faster than baselines across four environments, including simulators based on real data for recommendations, educational activity sequencing, and HIV treatment sequencing.

Online reinforcement learning (RL) algorithms are often difficult to deploy in complex human-facing applications as they may learn slowly and have poor early performance. To address this, we introduce a practical algorithm for incorporating human insight to speed learning. Our algorithm, Constraint Sampling Reinforcement Learning (CSRL), incorporates prior domain knowledge as constraints/restrictions on the RL policy. It takes in multiple potential policy constraints to maintain robustness to misspecification of individual constraints while leveraging helpful ones to learn quickly. Given a base RL learning algorithm (ex. UCRL, DQN, Rainbow) we propose an upper confidence with elimination scheme that leverages the relationship between the constraints, and their observed performance, to adaptively switch among them. We instantiate our algorithm with DQN-type algorithms and UCRL as base algorithms, and evaluate our algorithm in four environments, including three simulators based on real data: recommendations, educational activity sequencing, and HIV treatment sequencing. In all cases, CSRL learns a good policy faster than baselines.

View on arXiv PDF Code

Similar