LGAIEMAPMLDec 4, 2022

Counterfactual Learning with General Data-generating Policies

arXiv:2212.01925v12 citationsh-index: 14
Originality Incremental advance
AI Analysis

This work addresses a key limitation in off-policy evaluation for practitioners using deterministic logging policies, enabling more accurate performance prediction and policy improvement in real-world applications like online marketing.

The paper tackles the problem of evaluating counterfactual policies in contextual-bandit settings when log data comes from deterministic or deficient support policies, such as Upper Confidence Bound or decision-making based on supervised/unsupervised learning, by developing an off-policy evaluation method that converges to true performance with increasing sample size and demonstrates application to improve coupon targeting policies on an online platform.

Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log data from a different policy. We extend its applicability by developing an OPE method for a class of both full support and deficient support logging policies in contextual-bandit settings. This class includes deterministic bandit (such as Upper Confidence Bound) as well as deterministic decision-making based on supervised and unsupervised learning. We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases. We validate our method with experiments on partly and entirely deterministic logging policies. Finally, we apply it to evaluate coupon targeting policies by a major online platform and show how to improve the existing policy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes