LGAIIRMEMLSep 10, 2018

Efficient Counterfactual Learning from Bandit Feedback

arXiv:1809.03084v349 citations
AI Analysis

This addresses the problem of efficient counterfactual learning from logged bandit data for applications like advertisement optimization, though it appears incremental as it builds on existing off-policy evaluation methods.

The paper tackles the problem of statistically efficient off-policy evaluation and optimization using batch data from bandit feedback, showing that their estimators achieve lowest variance in a wide class and reduce variance relative to standard estimators. They apply this to improve advertisement design for a major company, outperforming a state-of-the-art benchmark with more statistical confidence.

What is the most statistically efficient way to do off-policy evaluation and optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-the-art benchmark.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes