A Convex Framework for Confounding Robust Inference
This work addresses the challenge of overly conservative policy value estimation in offline bandits for researchers and practitioners, offering a more precise method with theoretical guarantees.
The paper tackles the problem of policy evaluation in offline contextual bandits with unobserved confounders, proposing a convex programming estimator that provides a sharp lower bound for policy value, reducing conservatism compared to existing methods.
We study policy evaluation of offline contextual bandits subject to unobserved confounders. Sensitivity analysis methods are commonly used to estimate the policy value under the worst-case confounding over a given uncertainty set. However, existing work often resorts to some coarse relaxation of the uncertainty set for the sake of tractability, leading to overly conservative estimation of the policy value. In this paper, we propose a general estimator that provides a sharp lower bound of the policy value using convex programming. The generality of our estimator enables various extensions such as sensitivity analysis with f-divergence, model selection with cross validation and information criterion, and robust policy learning with the sharp lower bound. Furthermore, our estimation method can be reformulated as an empirical risk minimization problem thanks to the strong duality, which enables us to provide strong theoretical guarantees of the proposed estimator using techniques of the M-estimation.