LGMLOct 28, 2019

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

arXiv:1910.12809v4201 citations
Originality Incremental advance
AI Analysis

This addresses the problem of evaluating policies in RL without behavior policy knowledge, offering incremental improvements over prior methods.

The paper tackles off-policy evaluation in reinforcement learning by introducing two new estimators, MWL and MQL, which estimate importance ratios and value functions without requiring knowledge of the behavior policy, with results including sample complexity analyses and asymptotic optimality in tabular settings.

We provide theoretical investigations into off-policy evaluation in reinforcement learning using function approximators for (marginalized) importance weights and value functions. Our contributions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et al., 2018). (2) Another new estimator, MQL, obtained by swapping the roles of importance weights and value-functions in MWL. MQL has an intuitive interpretation of minimizing average Bellman errors and can be combined with MWL in a doubly robust manner. (3) Several additional results that offer further insights into these methods, including the sample complexity analyses of MWL and MQL, their asymptotic optimality in the tabular setting, how the learned importance weights depend the choice of the discriminator class, and how our methods provide a unified view of some old and new algorithms in RL.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes