Efficient and Adaptive Posterior Sampling Algorithms for Bandits
This work addresses scalability and computational efficiency issues in bandit algorithms for large-scale real-world applications, representing an incremental improvement with tighter bounds and new parameterized methods.
The paper tackles the problem of Thompson Sampling algorithms for stochastic bandits having vacuous regret bounds in small horizons, deriving a tighter bound with a coefficient reduced from 288e^64 to 1270, and proposes two scalable algorithms, TS-MA-α and TS-TD-α, that achieve O(K ln^{α+1}(T)/Δ) regret to balance utility and computation.
We study Thompson Sampling-based algorithms for stochastic bandits with bounded rewards. As the existing problem-dependent regret bound for Thompson Sampling with Gaussian priors [Agrawal and Goyal, 2017] is vacuous when $T \le 288 e^{64}$, we derive a more practical bound that tightens the coefficient of the leading term %from $288 e^{64}$ to $1270$. Additionally, motivated by large-scale real-world applications that require scalability, adaptive computational resource allocation, and a balance in utility and computation, we propose two parameterized Thompson Sampling-based algorithms: Thompson Sampling with Model Aggregation (TS-MA-$α$) and Thompson Sampling with Timestamp Duelling (TS-TD-$α$), where $α\in [0,1]$ controls the trade-off between utility and computation. Both algorithms achieve $O \left(K\ln^{α+1}(T)/Δ\right)$ regret bound, where $K$ is the number of arms, $T$ is the finite learning horizon, and $Δ$ denotes the single round performance loss when pulling a sub-optimal arm.