Regret Analysis of a Markov Policy Gradient Algorithm for Multi-arm Bandits
This work provides theoretical guarantees for a policy gradient variant in bandit problems, which is incremental as it modifies learning rate schedules rather than introducing a new paradigm.
The paper tackles the regret analysis of a policy gradient algorithm for multi-arm bandits with Bernoulli rewards by allowing state-dependent learning rates, proving that with well-chosen rates, the algorithm achieves logarithmic or poly-logarithmic regret as the state converges to the optimal arm.
We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm, rather than use a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster-Lyapunov techniques to analyse the stability of this Markov chain. We prove that if learning rates are well chosen then the policy gradient algorithm is a transient Markov chain and the state of the chain converges on the optimal arm with logarithmic or poly-logarithmic regret.