Bayesian Counterfactual Risk Minimization
This work addresses offline learning challenges in bandit feedback settings, offering a practical improvement for machine learning applications in domains like recommendation systems.
The authors tackled the problem of offline learning from logged bandit feedback by proposing a Bayesian view of counterfactual risk minimization, which led to a new generalization bound and a novel regularization technique. Experimental results showed this technique outperforms standard L2 regularization and is competitive with variance regularization while being simpler and more efficient.
We present a Bayesian view of counterfactual risk minimization (CRM) for offline learning from logged bandit feedback. Using PAC-Bayesian analysis, we derive a new generalization bound for the truncated inverse propensity score estimator. We apply the bound to a class of Bayesian policies, which motivates a novel, potentially data-dependent, regularization technique for CRM. Experimental results indicate that this technique outperforms standard $L_2$ regularization, and that it is competitive with variance regularization while being both simpler to implement and more computationally efficient.