Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits
This work provides tighter theoretical bounds for reinforcement learning algorithms, advancing the understanding of regularization in offline settings, though it is incremental in refining existing analyses.
The paper tackles the sample complexity of offline policy learning with f-divergence regularization in contextual bandits, achieving an improved Õ(ε⁻¹) bound under single-policy concentrability for reverse KL divergence and showing that sharp Õ(ε⁻¹) complexity is achievable for strongly convex f-divergences without such concentrability.
Although many popular reinforcement learning algorithms are underpinned by $f$-divergence regularization, their sample complexity with respect to the \emph{regularized objective} still lacks a tight characterization. In this paper, we analyze $f$-divergence-regularized offline policy learning. For reverse Kullback-Leibler (KL) divergence, arguably the most commonly used one, we give the first $\tilde{O}(ε^{-1})$ sample complexity under single-policy concentrability for contextual bandits, surpassing existing $\tilde{O}(ε^{-1})$ bound under all-policy concentrability and $\tilde{O}(ε^{-2})$ bound under single-policy concentrability. Our analysis for general function approximation leverages the principle of pessimism in the face of uncertainty to refine a mean-value-type argument to its extreme. This in turn leads to a novel moment-based technique, effectively bypassing the need for uniform control over the discrepancy between any two functions in the function class. We further propose a lower bound, demonstrating that a multiplicative dependency on single-policy concentrability is necessary to maximally exploit the strong convexity of reverse KL. In addition, for $f$-divergences with strongly convex $f$, to which reverse KL \emph{does not} belong, we show that the sharp sample complexity $\tildeΘ(ε^{-1})$ is achievable even without single-policy concentrability. In this case, the algorithm design can get rid of pessimistic estimators. We further extend our analysis to dueling bandits, and we believe these results take a significant step toward a comprehensive understanding of $f$-divergence-regularized policy learning.