Thompson Sampling for Linear Bandit Problems with Normal-Gamma Priors
This work addresses bandit optimization for scenarios with uncertain variance, but it is incremental as it extends prior methods with specific priors.
The paper tackles the linear bandit problem with unknown variance by using Thompson sampling with normal-gamma priors, deriving a Bayesian regret bound under a moment condition.
We consider Thompson sampling for linear bandit problems with finitely many independent arms, where rewards are sampled from normal distributions that are linearly dependent on unknown parameter vectors and with unknown variance. Specifically, with a Bayesian formulation we consider multivariate normal-gamma priors to represent environment uncertainty for all involved parameters. We show that our chosen sampling prior is a conjugate prior to the reward model and derive a Bayesian regret bound for Thompson sampling under the condition that the 5/2-moment of the variance distribution exist.