Predicting Long Term Sequential Policy Value Using Softer Surrogates
This addresses a key challenge in domains like healthcare, where novel treatments require costly long-term trials, though it is incremental as it builds on existing surrogacy conditions and OPE frameworks.
The paper tackles the problem of predicting long-term policy outcomes when new policies introduce novel actions, which existing off-policy evaluation methods cannot handle. In simulated healthcare examples for HIV and sepsis management, their estimators accurately predict policy value after observing only 10% of the full horizon data.
Off-policy policy evaluation (OPE) estimates the outcome of a new policy using historical data collected from a different policy. However, existing OPE methods cannot handle cases when the new policy introduces novel actions. This issue commonly occurs in real-world domains, like healthcare, as new drugs and treatments are continuously developed. Novel actions necessitate on-policy data collection, which can be burdensome and expensive if the outcome of interest takes a substantial amount of time to observe--for example, in multi-year clinical trials. This raises a key question of how to predict the long-term outcome of a policy after only observing its short-term effects? Though in general this problem is intractable, under some surrogacy conditions, the short-term on-policy data can be combined with the long-term historical data to make accurate predictions about the new policy's long-term value. In two simulated healthcare examples--HIV and sepsis management--we show that our estimators can provide accurate predictions about the policy value only after observing 10\% of the full horizon data. We also provide finite sample analysis of our doubly robust estimators.