LG ME MLOct 21, 2020

Optimal Off-Policy Evaluation from Multiple Logging Policies

Nathan Kallus, Yuta Saito, Masatoshi Uehara

arXiv:2010.11002v115.645 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of optimal variance reduction in off-policy evaluation for scenarios with multiple data sources, which is incremental as it builds on prior methods to improve efficiency in specific settings.

The paper tackles the problem of off-policy evaluation from multiple logging policies with stratified sampling, resolving the dilemma of choosing importance sampling weights by deriving an efficient estimator with minimum variance for any instance. The result includes establishing an efficiency bound and proposing an estimator that achieves it with consistent q-estimates, supported by extensive experiments.

We study off-policy evaluation (OPE) from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling. Previous work noted that in this setting the ordering of the variances of different importance sampling estimators is instance-dependent, which brings up a dilemma as to which importance sampling weights to use. In this paper, we resolve this dilemma by finding the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one. In particular, we establish the efficiency bound under stratified sampling and propose an estimator achieving this bound when given consistent $q$-estimates. To guard against misspecification of $q$-functions, we also provide a way to choose the control variate in a hypothesis class to minimize variance. Extensive experiments demonstrate the benefits of our methods' efficiently leveraging of the stratified sampling of off-policy data from multiple loggers.

View on arXiv PDF

Similar