AILGJun 13, 2021

Characterizing the Gap Between Actor-Critic and Policy Gradient

arXiv:2106.06932v118 citations
Originality Incremental advance
AI Analysis

This work addresses a theoretical gap in reinforcement learning for researchers and practitioners, though it is incremental as it builds on existing actor-critic methods.

The paper characterizes the gap between actor-critic and policy gradient methods by identifying an exact adjustment to recover the true policy gradient, leading to practical algorithms that improve sample efficiency and final performance in experiments on tabular and continuous environments.

Actor-critic (AC) methods are ubiquitous in reinforcement learning. Although it is understood that AC methods are closely related to policy gradient (PG), their precise connection has not been fully characterized previously. In this paper, we explain the gap between AC and PG methods by identifying the exact adjustment to the AC objective/gradient that recovers the true policy gradient of the cumulative reward objective (PG). Furthermore, by viewing the AC method as a two-player Stackelberg game between the actor and critic, we show that the Stackelberg policy gradient can be recovered as a special case of our more general analysis. Based on these results, we develop practical algorithms, Residual Actor-Critic and Stackelberg Actor-Critic, for estimating the correction between AC and PG and use these to modify the standard AC algorithm. Experiments on popular tabular and continuous environments show the proposed corrections can improve both the sample efficiency and final performance of existing AC methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes