MLAILGSTMar 10

A Diffusion Analysis of Policy Gradient for Stochastic Bandits

arXiv:2603.10219v127.71 citationsh-index: 38
Predicted impact top 4% in ML · last 90 daysOriginality Incremental advance
AI Analysis

This provides theoretical insights into the performance of policy gradient in bandit settings, which is incremental as it builds on existing methods with new analysis.

The paper tackles the problem of analyzing policy gradient methods for stochastic bandits by using a continuous-time diffusion approximation, proving that with a specific learning rate, the regret scales as O(k log(k) log(n)/η), and constructing an instance showing linear regret unless the learning rate is sufficiently small.

We study a continuous-time diffusion approximation of policy gradient for $k$-armed stochastic bandits. We prove that with a learning rate $η= O(Δ^2/\log(n))$ the regret is $O(k \log(k) \log(n) / η)$ where $n$ is the horizon and $Δ$ the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless $η= O(Δ^2)$.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes