LG ST MLFeb 21, 2020

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

arXiv:2002.09516v130.2164 citations

Originality Highly original

AI Analysis

This work addresses the statistical limits of evaluating new policies from logged data, which is crucial for safe and efficient reinforcement learning applications, though it is incremental in refining theoretical understanding.

The paper tackles the off-policy evaluation problem in batch reinforcement learning by proposing a regression-based fitted Q iteration method, proving it is information-theoretically optimal with nearly minimal estimation error, as shown through finite-sample error bounds and a matching minimax lower bound.

This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history generated by unknown behavioral policies. We study a regression-based fitted Q iteration method, and show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator. We prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted $χ^2$-divergence over the function class between the long-term distribution of the target policy and the distribution of past data. This restricted $χ^2$-divergence is both instance-dependent and function-class-dependent. It characterizes the statistical limit of off-policy evaluation. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.

View on arXiv PDF

Similar