Cramming Contextual Bandits for On-policy Statistical Evaluation
This provides a more efficient statistical framework for evaluating contextual bandit policies, which is important for researchers and practitioners in reinforcement learning and online decision-making systems.
The authors tackled the problem of evaluating learned policies in contextual bandit algorithms using the same dataset generated by the algorithm, introducing the 'cram' method for on-policy evaluation. The results show that cram reduces evaluation standard error by approximately 40% compared to off-policy methods while maintaining unbiasedness and valid confidence intervals.
We introduce the cram method as a general statistical framework for evaluating the final learned policy from a multi-armed contextual bandit algorithm, using the dataset generated by the same bandit algorithm. The proposed on-policy evaluation methodology differs from most existing methods that focus on off-policy performance evaluation of contextual bandit algorithms. Cramming utilizes an entire bandit sequence through a single pass of data, leading to both statistically and computationally efficient evaluation. We prove that if a bandit algorithm satisfies a certain stability condition, the resulting crammed evaluation estimator is consistent and asymptotically normal under mild regularity conditions. Furthermore, we show that this stability condition holds for commonly used linear contextual bandit algorithms, including epsilon-greedy, Thompson Sampling, and Upper Confidence Bound algorithms. Using both synthetic and publicly available datasets, we compare the empirical performance of cramming with the state-of-the-art methods. The results demonstrate that the proposed cram method reduces the evaluation standard error by approximately 40% relative to off-policy evaluation methods while preserving unbiasedness and valid confidence interval coverage.