MLAIDSLGNAFeb 9, 2024

Convergence of a L2 regularized Policy Gradient Algorithm for the Multi Armed Bandit

arXiv:2402.06388v33 citationsh-index: 19ICPR
AI Analysis

This work addresses a theoretical gap in reinforcement learning by analyzing policy gradient methods for multi-armed bandits, though it appears incremental as it focuses on adding regularization to an existing framework.

The authors investigated the convergence of a policy gradient algorithm with L2 regularization for multi-armed bandit problems, proving theoretical convergence under specific conditions and demonstrating through numerical tests that a time-dependent regularized approach can outperform the canonical method, particularly when initial guesses are poor.

Although Multi Armed Bandit (MAB) on one hand and the policy gradient approach on the other hand are among the most used frameworks of Reinforcement Learning, the theoretical properties of the policy gradient algorithm used for MAB have not been given enough attention. We investigate in this work the convergence of such a procedure for the situation when a $L2$ regularization term is present jointly with the 'softmax' parametrization. We prove convergence under appropriate technical hypotheses and test numerically the procedure including situations beyond the theoretical setting. The tests show that a time dependent regularized procedure can improve over the canonical approach especially when the initial guess is far from the solution.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes