On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization

arXiv:2605.0214187.21 citations
AI Analysis

For researchers in offline decision-making, this work provides tight sample complexity bounds for KL-regularized offline bandits, resolving an open question.

This paper characterizes the sample complexity of offline multi-armed bandits with KL regularization, providing matching upper and lower bounds that scale as Õ(ηSAC^{π^*}/ε) for large regularization and Ω̃(SAC^{π^*}/ε^2) for small regularization, achieving a nearly complete characterization.

Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs). We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of $\tilde{O}(ηSAC^{π^*}/ε)$ under large regularization $η= \tilde{O}(ε^{-1})$, and a sample complexity of $\tildeΩ(SAC^{π^*}/ε^2)$ under small regularization $η= \tildeΩ(ε^{-1})$, where $η$ is the regularization parameter, $S$ is the number of contexts, $A$ is the number of arms, $C^{π^*}$ policy coverage coefficient at the optimal policy $π^*$, $ε$ is the desired sub-optimality, and $\tilde{O}$ and $\tildeΩ$ hide all poly-logarithmic factors. We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths. Overall, our results provide a nearly complete characterization of offline multi-armed bandits with KL regularization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes