OCLGSYJun 6, 2024

Optimal Rates of Convergence for Entropy Regularization in Discounted Markov Decision Processes

arXiv:2406.04163v51 citations
Originality Highly original
AI Analysis

This provides theoretical guarantees for entropy regularization in reinforcement learning, improving convergence rates for policy optimization methods.

The paper tackles the error introduced by entropy regularization in discounted Markov decision processes, showing that it decreases exponentially with the inverse regularization strength, improving over previous linear estimates, and uses this to prove exponential convergence for natural policy gradient methods.

We study the error introduced by entropy regularization in infinite-horizon discrete discounted Markov decision processes. We show that this error decreases exponentially in the inverse regularization strength, both in a weighted KL-divergence and in value with a problem-specific exponent. This is in contrast to previously known estimates, of the order $O(τ)$, where $τ$ is the regularization strength. We provide a lower bound that matches our upper bound up to a polynomial term, thereby characterizing the exponential convergence rate for entropy regularization. Our proof relies on the observation that the solutions of entropy-regularized Markov decision processes solve a gradient flow of the unregularized reward with respect to a Riemannian metric common in natural policy gradient methods. This correspondence allows us to identify the limit of this gradient flow as the generalized maximum entropy optimal policy, thereby characterizing the implicit bias of this gradient flow, which corresponds to a time-continuous version of the natural policy gradient method. We use our improved error estimates to show that for entropy-regularized natural policy gradient methods, the overall error decays exponentially in the square root of the number of iterations, improving over existing sublinear guarantees. Finally, we extend our analysis to settings beyond the entropy. In particular, we characterize the implicit bias regarding general convex potentials and their resulting generalized natural policy gradients.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes