MLLGOct 24, 2022

PAC-Bayesian Offline Contextual Bandits With Guarantees

arXiv:2210.13132v223 citationsh-index: 27
Originality Highly original
AI Analysis

This provides a principled method with guarantees for offline contextual bandit problems, addressing a known bottleneck in reinforcement learning.

The paper tackles off-policy learning in contextual bandits by introducing a PAC-Bayesian approach that interprets policies as mixtures of decision rules, resulting in tighter generalization bounds and tractable algorithms that improve upon logging policies offline without needing hyperparameter tuning on held-out sets.

This paper introduces a new principled approach for off-policy learning in contextual bandits. Unlike previous work, our approach does not derive learning principles from intractable or loose bounds. We analyse the problem through the PAC-Bayesian lens, interpreting policies as mixtures of decision rules. This allows us to propose novel generalization bounds and provide tractable algorithms to optimize them. We prove that the derived bounds are tighter than their competitors, and can be optimized directly to confidently improve upon the logging policy offline. Our approach learns policies with guarantees, uses all available data and does not require tuning additional hyperparameters on held-out sets. We demonstrate through extensive experiments the effectiveness of our approach in providing performance guarantees in practical scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes