LG AI MLJun 10, 2018

Distributional Advantage Actor-Critic

arXiv:1806.06914v15.78 citations

Originality Incremental advance

AI Analysis

This work addresses reinforcement learning stability and performance for AI agents, but it is incremental as it builds on existing actor-critic and distributional methods.

The paper tackled the problem of improving reinforcement learning by replacing value functions with value distributions, developing the Distributional Advantage Actor-Critic (DA2C) algorithm that achieved performance at least as good as baselines and outperformed them in some tasks with smaller variance and increased stability.

In traditional reinforcement learning, an agent maximizes the reward collected during its interaction with the environment by approximating the optimal policy through the estimation of value functions. Typically, given a state s and action a, the corresponding value is the expected discounted sum of rewards. The optimal action is then chosen to be the action a with the largest value estimated by value function. However, recent developments have shown both theoretical and experimental evidence of superior performance when value function is replaced with value distribution in context of deep Q learning [1]. In this paper, we develop a new algorithm that combines advantage actor-critic with value distribution estimated by quantile regression. We evaluated this new algorithm, termed Distributional Advantage Actor-Critic (DA2C or QR-A2C) on a variety of tasks, and observed it to achieve at least as good as baseline algorithms, and outperforming baseline in some tasks with smaller variance and increased stability.

View on arXiv PDF

Similar