LGMLOct 7, 2019

Multi-step Greedy Reinforcement Learning Algorithms

arXiv:1910.02919v312 citations
Originality Incremental advance
AI Analysis

This work addresses performance improvement in model-free reinforcement learning for applications like gaming and robotics, though it is incremental as it builds on existing RL methods.

The paper tackles the problem of improving model-free reinforcement learning performance by introducing multi-step greedy algorithms, κ-Policy Iteration and κ-Value Iteration, which use surrogate decision problems with shaped rewards and reduced discount factors. Results show that for appropriate κ values, these algorithms outperform DQN and TRPO on Atari and MuJoCo benchmarks, indicating significant performance gains.

Multi-step greedy policies have been extensively used in model-based reinforcement learning (RL), both when a model of the environment is available (e.g.,~in the game of Go) and when it is learned. In this paper, we explore their benefits in model-free RL, when employed using multi-step dynamic programming algorithms: $κ$-Policy Iteration ($κ$-PI) and $κ$-Value Iteration ($κ$-VI). These methods iteratively compute the next policy ($κ$-PI) and value function ($κ$-VI) by solving a surrogate decision problem with a shaped reward and a smaller discount factor. We derive model-free RL algorithms based on $κ$-PI and $κ$-VI in which the surrogate problem can be solved by any discrete or continuous action RL method, such as DQN and TRPO. We identify the importance of a hyper-parameter that controls the extent to which the surrogate problem is solved and suggest a way to set this parameter. When evaluated on a range of Atari and MuJoCo benchmark tasks, our results indicate that for the right range of $κ$, our algorithms outperform DQN and TRPO. This shows that our multi-step greedy algorithms are general enough to be applied over any existing RL algorithm and can significantly improve its performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes