LG AI MLFeb 12, 2020

Data Efficient Training for Reinforcement Learning with Adaptive Behavior Policy Sharing

Ge Liu, Rui Wu, Heng-Tze Cheng, Jing Wang, Jayden Ooi, Lihong Li, Ang Li, Wai Lok Sibon Li, Craig Boutilier, Ed Chi

arXiv:2002.05229v12.34 citations

Originality Incremental advance

AI Analysis

This addresses data inefficiency for real-world RL applications like healthcare and recommender systems, but it is incremental as it builds on existing hyper-parameter tuning methods.

The authors tackled the data inefficiency problem in deep reinforcement learning by proposing Adaptive Behavior Policy Sharing (ABPS), which shares experiences from adaptively selected policies to reduce hyper-parameter tuning costs, achieving superior overall performance and reduced variance in Atari games with the same interaction budget as training a single agent.

Deep Reinforcement Learning (RL) is proven powerful for decision making in simulated environments. However, training deep RL model is challenging in real world applications such as production-scale health-care or recommender systems because of the expensiveness of interaction and limitation of budget at deployment. One aspect of the data inefficiency comes from the expensive hyper-parameter tuning when optimizing deep neural networks. We propose Adaptive Behavior Policy Sharing (ABPS), a data-efficient training algorithm that allows sharing of experience collected by behavior policy that is adaptively selected from a pool of agents trained with an ensemble of hyper-parameters. We further extend ABPS to evolve hyper-parameters during training by hybridizing ABPS with an adapted version of Population Based Training (ABPS-PBT). We conduct experiments with multiple Atari games with up to 16 hyper-parameter/architecture setups. ABPS achieves superior overall performance, reduced variance on top 25% agents, and equivalent performance on the best agent compared to conventional hyper-parameter tuning with independent training, even though ABPS only requires the same number of environmental interactions as training a single agent. We also show that ABPS-PBT further improves the convergence speed and reduces the variance.

View on arXiv PDF

Similar