A Diffusion Analysis of Policy Gradient for Stochastic Bandits
This provides theoretical insights into the performance of policy gradient in bandit settings, which is incremental as it builds on existing methods with new analysis.
The paper tackles the problem of analyzing policy gradient methods for stochastic bandits by using a continuous-time diffusion approximation, proving that with a specific learning rate, the regret scales as O(k log(k) log(n)/η), and constructing an instance showing linear regret unless the learning rate is sufficiently small.
We study a continuous-time diffusion approximation of policy gradient for $k$-armed stochastic bandits. We prove that with a learning rate $η= O(Δ^2/\log(n))$ the regret is $O(k \log(k) \log(n) / η)$ where $n$ is the horizon and $Δ$ the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless $η= O(Δ^2)$.