LGNov 16, 2021

Off-Policy Actor-Critic with Emphatic Weightings

Eric Graves, Ehsan Imani, Raksha Kumaraswamy, Martha White

arXiv:2111.08172v36.57 citationsHas Code

Originality Highly original

AI Analysis

This work addresses a foundational problem in reinforcement learning for researchers and practitioners by providing a theoretically sound off-policy actor-critic method, though it is incremental in improving upon existing approaches.

The paper tackles the lack of a clear policy gradient theorem in off-policy reinforcement learning by unifying objectives and deriving a new theorem using emphatic weightings, resulting in the ACE algorithm that converges to optimal solutions where previous methods fail, as shown in empirical tests where ACE matches or outperforms OffPAC in control environments.

A variety of theoretically-sound policy gradient algorithms exist for the on-policy setting due to the policy gradient theorem, which provides a simplified form for the gradient. The off-policy setting, however, has been less clear due to the existence of multiple objectives and the lack of an explicit off-policy policy gradient theorem. In this work, we unify these objectives into one off-policy objective, and provide a policy gradient theorem for this unified objective. The derivation involves emphatic weightings and interest functions. We show multiple strategies to approximate the gradients, in an algorithm called Actor Critic with Emphatic weightings (ACE). We prove in a counterexample that previous (semi-gradient) off-policy actor-critic methods--particularly Off-Policy Actor-Critic (OffPAC) and Deterministic Policy Gradient (DPG)--converge to the wrong solution whereas ACE finds the optimal solution. We also highlight why these semi-gradient approaches can still perform well in practice, suggesting strategies for variance reduction in ACE. We empirically study several variants of ACE on two classic control environments and an image-based environment designed to illustrate the tradeoffs made by each gradient approximation. We find that by approximating the emphatic weightings directly, ACE performs as well as or better than OffPAC in all settings tested.

View on arXiv PDF Code

Similar