LGOCSTMLJun 10, 2020

Distributionally Robust Batch Contextual Bandits

arXiv:2006.05630v734 citations
Originality Highly original
AI Analysis

This work addresses the critical issue of policy deployment in shifting environments, such as in marketing or healthcare, offering a robust solution for real-world applications where data assumptions often fail.

The paper tackles the problem of learning robust policies from historical data when future environments may differ, by proposing a distributionally robust policy evaluation and learning algorithm that provides performance guarantees under worst-case shifts. Empirical tests on synthetic and real-world voting datasets demonstrate the algorithm's effectiveness in providing robustness missing in standard methods.

Policy learning using historical observational data is an important problem that has found widespread applications. Examples include selecting offers, prices, advertisements to send to customers, as well as selecting which medication to prescribe to a patient. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data -- an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data. We first present a policy evaluation procedure that allows us to assess how well the policy does under the worst-case environment shift. We then establish a central limit theorem type guarantee for this proposed policy evaluation scheme. Leveraging this evaluation scheme, we further propose a novel learning algorithm that is able to learn a policy that is robust to adversarial perturbations and unknown covariate shifts with a performance guarantee based on the theory of uniform convergence. Finally, we empirically test the effectiveness of our proposed algorithm in synthetic datasets and demonstrate that it provides the robustness that is missing using standard policy learning algorithms. We conclude the paper by providing a comprehensive application of our methods in the context of a real-world voting dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes