LG STSep 23, 2025

Asymptotically Optimal Problem-Dependent Bandit Policies for Transfer Learning

Adrien Prevost, Timothee Mathieu, Odalric-Ambrym Maillard

arXiv:2509.19098v14.1h-index: 27

Originality Incremental advance

AI Analysis

This addresses the problem of improving bandit algorithm efficiency using prior knowledge for researchers in reinforcement learning and decision-making, though it appears incremental as it extends existing KL-UCB methods to a transfer context.

The paper tackles the multi-armed bandit problem in a transfer learning setting where source samples inform target distributions, deriving an asymptotic lower bound on regret that incorporates transfer parameters and proposing KL-UCB-Transfer, which matches this bound for Gaussian distributions and outperforms baselines when source and target are close.

We study the non-contextual multi-armed bandit problem in a transfer learning setting: before any pulls, the learner is given N'_k i.i.d. samples from each source distribution nu'_k, and the true target distributions nu_k lie within a known distance bound d_k(nu_k, nu'_k) <= L_k. In this framework, we first derive a problem-dependent asymptotic lower bound on cumulative regret that extends the classical Lai-Robbins result to incorporate the transfer parameters (d_k, L_k, N'_k). We then propose KL-UCB-Transfer, a simple index policy that matches this new bound in the Gaussian case. Finally, we validate our approach via simulations, showing that KL-UCB-Transfer significantly outperforms the no-prior baseline when source and target distributions are sufficiently close.

View on arXiv PDF

Similar