Offline-to-Online Learning in Linear Bandits
For practitioners in sequential decision-making, this work provides a principled method to leverage offline data in linear bandits, addressing a common practical bottleneck.
The paper proposes a linear bandit algorithm that effectively balances offline and online data, achieving sublinear regret in online interactions while benefiting from offline samples. Empirical results confirm its effectiveness across various settings.
We study online learning with an additional offline dataset in the stochastic linear bandit setting. Although this problem arises frequently in practice, the offline-to-online tradeoff remains poorly understood in structured environments. We propose a linear bandit algorithm that balances this tradeoff: it relies on offline data during early rounds, and increasingly favors exploration as the horizon grows. We establish regret bounds showing that our method is simultaneously competitive with both purely online and purely offline solutions. In particular, it achieves sublinear regret relative to the optimal action in the number of online interactions, while its regret relative to an offline reference decreases as the number of offline samples grows. Empirical results further demonstrate its effectiveness across various problem parameters.