LGAISep 11, 2025

Revisiting Actor-Critic Methods in Discrete Action Off-Policy Reinforcement Learning

arXiv:2509.09838v11 citationsh-index: 17
Originality Incremental advance
AI Analysis

This addresses a bottleneck for researchers and practitioners in RL by improving off-policy methods for discrete-action environments like Atari, though it is incremental as it builds on DSAC.

The paper tackled the poor performance of actor-critic methods in discrete-action off-policy reinforcement learning, showing that decoupling actor and critic entropy in DSAC achieves performance comparable to DQN on Atari games, with theoretical convergence guarantees.

Value-based approaches such as DQN are the default methods for off-policy reinforcement learning with discrete-action environments such as Atari. Common policy-based methods are either on-policy and do not effectively learn from off-policy data (e.g. PPO), or have poor empirical performance in the discrete-action setting (e.g. SAC). Consequently, starting from discrete SAC (DSAC), we revisit the design of actor-critic methods in this setting. First, we determine that the coupling between the actor and critic entropy is the primary reason behind the poor performance of DSAC. We demonstrate that by merely decoupling these components, DSAC can have comparable performance as DQN. Motivated by this insight, we introduce a flexible off-policy actor-critic framework that subsumes DSAC as a special case. Our framework allows using an m-step Bellman operator for the critic update, and enables combining standard policy optimization methods with entropy regularization to instantiate the resulting actor objective. Theoretically, we prove that the proposed methods can guarantee convergence to the optimal regularized value function in the tabular setting. Empirically, we demonstrate that these methods can approach the performance of DQN on standard Atari games, and do so even without entropy regularization or explicit exploration.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes