LGFeb 6, 2025

Mirror Descent Actor Critic via Bounded Advantage Learning

arXiv:2502.03854v2
Originality Incremental advance
AI Analysis

This work addresses a performance gap in continuous control RL for researchers and practitioners, though it is incremental as it builds on existing regularization frameworks.

The paper tackles the underperformance of KL-entropy-regularized reinforcement learning methods in continuous action domains by proposing Mirror Descent Actor Critic (MDAC), which bounds the actor's log-density terms in the critic's loss, leading to significantly better empirical performance than non-bounded and entropy-only-regularized methods.

Regularization is a core component of recent Reinforcement Learning (RL) algorithms. Mirror Descent Value Iteration (MDVI) uses both Kullback-Leibler divergence and entropy as regularizers in its value and policy updates. Despite its empirical success in discrete action domains and strong theoretical guarantees, the performance of KL-entroy-regularized methods do not surpass a strong entropy-only-regularized method in continuous action domains. In this study, we propose Mirror Descent Actor Critic (MDAC) as an actor-critic style instantiation of MDVI for continuous action domains, and show that its empirical performance is significantly boosted by bounding the actor's log-density terms in the critic's loss function, compared to a non-bounded naive instantiation. Further, we relate MDAC to Advantage Learning by recalling that the actor's log-probability is equal to the regularized advantage function in tabular cases, and theoretically discuss when and why bounding the advantage terms is validated and beneficial. We also empirically explore effective choices for the bounding functions, and show that MDAC performs better than strong non-regularized and entropy-only-regularized methods with an appropriate choice of the bounding functions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes