Meta Off-Policy Estimation
This work addresses the challenge for practitioners and researchers in recommender systems of selecting among many OPE estimators by providing a method to combine them, though it is incremental as it builds on existing OPE frameworks.
The paper tackles the problem of combining multiple off-policy estimation (OPE) methods into a single, more accurate estimate by using a correlated fixed-effects meta-analysis framework, resulting in improved statistical efficiency validated on simulated and real-world data.
Off-policy estimation (OPE) methods enable unbiased offline evaluation of recommender systems, directly estimating the online reward some target policy would have obtained, from offline data and with statistical guarantees. The theoretical elegance of the framework combined with practical successes have led to a surge of interest, with many competing estimators now available to practitioners and researchers. Among these, Doubly Robust methods provide a prominent strategy to combine value- and policy-based estimators. In this work, we take an alternative perspective to combine a set of OPE estimators and their associated confidence intervals into a single, more accurate estimate. Our approach leverages a correlated fixed-effects meta-analysis framework, explicitly accounting for dependencies among estimators that arise due to shared data. This yields a best linear unbiased estimate (BLUE) of the target policy's value, along with an appropriately conservative confidence interval that reflects inter-estimator correlation. We validate our method on both simulated and real-world data, demonstrating improved statistical efficiency over existing individual estimators.