Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching

arXiv:2605.0647459.5

Predicted impact top 38% in LG · last 90 daysOriginality Highly original

AI Analysis

Provides a novel theoretical framework for off-policy evaluation in RL with improved guarantees, addressing coverage and function approximation challenges.

Q-MMR introduces a recursive reweighting method for off-policy evaluation that achieves a dimension-free finite-sample error bound under only realizability of Q^π, without depending on the statistical complexity of the function class.

We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of $Q^π$, with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.

View on arXiv PDF

Similar