AIApr 24, 2016

Deep Learning for Reward Design to Improve Monte Carlo Tree Search in ATARI Games

arXiv:1604.07095v156 citations
Originality Incremental advance
AI Analysis

This work addresses performance issues in MCTS for sequential decision-making in video games, representing an incremental improvement by combining existing methods with deep learning for more practical applicability.

The authors tackled the problem of poor Monte Carlo Tree Search (MCTS) performance in ATARI games due to limited planning depth, sampling trajectories, or sparse rewards by adapting PGRD with a deep convolutional neural network to learn reward-bonus functions, resulting in improved UCT performance on multiple games.

Monte Carlo Tree Search (MCTS) methods have proven powerful in planning for sequential decision-making problems such as Go and video games, but their performance can be poor when the planning depth and sampling trajectories are limited or when the rewards are sparse. We present an adaptation of PGRD (policy-gradient for reward-design) for learning a reward-bonus function to improve UCT (a MCTS algorithm). Unlike previous applications of PGRD in which the space of reward-bonus functions was limited to linear functions of hand-coded state-action-features, we use PGRD with a multi-layer convolutional neural network to automatically learn features from raw perception as well as to adapt the non-linear reward-bonus function parameters. We also adopt a variance-reducing gradient method to improve PGRD's performance. The new method improves UCT's performance on multiple ATARI games compared to UCT without the reward bonus. Combining PGRD and Deep Learning in this way should make adapting rewards for MCTS algorithms far more widely and practically applicable than before.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes