Adaptive Estimator Selection for Off-Policy Evaluation
This addresses the challenge of estimator selection in off-policy evaluation for researchers and practitioners in machine learning, though it appears incremental as it builds on existing methods.
The paper tackles the problem of selecting estimators for off-policy evaluation by developing a data-driven method that guarantees competitive performance with an oracle estimator, demonstrating favorable results in contextual bandits and reinforcement learning case studies.
We develop a generic data-driven method for estimator selection in off-policy policy evaluation settings. We establish a strong performance guarantee for the method, showing that it is competitive with the oracle estimator, up to a constant factor. Via in-depth case studies in contextual bandits and reinforcement learning, we demonstrate the generality and applicability of the method. We also perform comprehensive experiments, demonstrating the empirical efficacy of our approach and comparing with related approaches. In both case studies, our method compares favorably with existing methods.