CAB: Continuous Adaptive Blending Estimator for Policy Evaluation and Learning
This addresses the need for reliable counterfactual estimation in applications like recommender systems and personalized healthcare, offering a novel estimator with improved theoretical and practical properties.
The paper tackles the problem of offline policy evaluation and learning in contextual bandits by introducing the Continuous Adaptive Blending (CAB) estimator, which reduces bias compared to clipped IPS and Direct Method and variance compared to Doubly Robust and IPS, with experimental results showing excellent evaluation accuracy and superior learning performance.
The ability to perform offline A/B-testing and off-policy learning using logged contextual bandit feedback is highly desirable in a broad range of applications, including recommender systems, search engines, ad placement, and personalized health care. Both offline A/B-testing and off-policy learning require a counterfactual estimator that evaluates how some new policy would have performed, if it had been used instead of the logging policy. In this paper, we identify a family of counterfactual estimators which subsumes most such estimators proposed to date. Our analysis of this family identifies a new estimator - called Continuous Adaptive Blending (CAB) - which enjoys many advantageous theoretical and practical properties. In particular, it can be substantially less biased than clipped Inverse Propensity Score (IPS) weighting and the Direct Method, and it can have less variance than Doubly Robust and IPS estimators. In addition, it is sub-differentiable such that it can be used for learning, unlike the SWITCH estimator. Experimental results show that CAB provides excellent evaluation accuracy and outperforms other counterfactual estimators in terms of learning performance.