Off-Policy Evaluation with Policy-Dependent Optimization Response
This work addresses a fundamental challenge in causal inference for decision-making under operational constraints, offering a novel approach to policy evaluation and optimization, though it appears incremental by extending existing methods to handle optimization responses.
The authors tackled the problem of off-policy evaluation when decision-making involves optimization responses, such as matching or assignment, rather than simple averages, by developing a framework to address optimization bias and constructing unbiased estimators. They achieved this through a perturbation method and provided an algorithm for policy optimization, supported by numerical simulations.
The intersection of causal inference and machine learning for decision-making is rapidly expanding, but the default decision criterion remains an \textit{average} of individual causal outcomes across a population. In practice, various operational restrictions ensure that a decision-maker's utility is not realized as an \textit{average} but rather as an \textit{output} of a downstream decision-making problem (such as matching, assignment, network flow, minimizing predictive risk). In this work, we develop a new framework for off-policy evaluation with \textit{policy-dependent} linear optimization responses: causal outcomes introduce stochasticity in objective function coefficients. Under this framework, a decision-maker's utility depends on the policy-dependent optimization, which introduces a fundamental challenge of \textit{optimization} bias even for the case of policy evaluation. We construct unbiased estimators for the policy-dependent estimand by a perturbation method, and discuss asymptotic variance properties for a set of adjusted plug-in estimators. Lastly, attaining unbiased policy evaluation allows for policy optimization: we provide a general algorithm for optimizing causal interventions. We corroborate our theoretical results with numerical simulations.