ML AI IR LGAug 7, 2023

Doubly Robust Estimator for Off-Policy Evaluation with Large Action Spaces

arXiv:2308.03443v34.31 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of accurate policy evaluation in large-scale recommendation or decision-making systems, representing an incremental improvement over existing estimators.

The paper tackles the problem of bias-variance tradeoffs in Off-Policy Evaluation (OPE) for contextual bandits with large action spaces by proposing a Marginalized Doubly Robust (MDR) estimator, which reduces variance and achieves unbiasedness under weaker assumptions than prior methods like MIPS, as verified empirically.

We study Off-Policy Evaluation (OPE) in contextual bandit settings with large action spaces. The benchmark estimators suffer from severe bias and variance tradeoffs. Parametric approaches suffer from bias due to difficulty specifying the correct model, whereas ones with importance weight suffer from variance. To overcome these limitations, Marginalized Inverse Propensity Scoring (MIPS) was proposed to mitigate the estimator's variance via embeddings of an action. Nevertheless, MIPS is unbiased under the no direct effect, which assumes that the action embedding completely mediates the effect of an action on a reward. To overcome the dependency on these unrealistic assumptions, we propose a Marginalized Doubly Robust (MDR) estimator. Theoretical analysis shows that the proposed estimator is unbiased under weaker assumptions than MIPS while reducing the variance against MIPS. The empirical experiment verifies the supremacy of MDR against existing estimators with large action spaces.

View on arXiv PDF Code

Similar