LGITMLFeb 15, 2025

Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing

arXiv:2502.10826v23 citationsh-index: 2COLT
Originality Incremental advance
AI Analysis

This work addresses off-policy evaluation and optimization in contextual bandits, offering incremental improvements with specific theoretical and empirical gains for reinforcement learning and decision-making applications.

The paper tackles off-policy selection and learning in contextual bandits by proposing a betting-based selection method with variance-adaptive guarantees and a freezing technique for low variance in small-data regimes, showing empirical outperformance in selection and improved performance in small-sample cases.

We consider off-policy selection and learning in contextual bandits, where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a novel off-policy selection method that leverages a new betting-based confidence bound applied to an inverse propensity weight sequence. Our theoretical analysis reveals that this method achieves a significantly improved, variance-adaptive guarantee over prior work. Second, we propose a novel and generic condition on the optimization objective for off-policy learning that strikes a different balance between bias and variance. One special case, which we call freezing, tends to induce low variance, which is preferred in small-data regimes. Our analysis shows that it matches the best existing guarantees. In our empirical study, our selection method outperforms existing methods, and freezing exhibits improved performance in small-sample regimes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes