LGAISYMLOct 29, 2018

Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

arXiv:1810.12429v1391 citations
Originality Incremental advance
AI Analysis

This addresses a critical bottleneck in reinforcement learning for long-term decision-making, offering a solution to a well-known issue with incremental methodological innovation.

The paper tackles the high variance problem in off-policy estimation for infinite-horizon tasks by proposing a method that applies importance sampling on stationary state-visitation distributions, avoiding unbounded variance and showing empirical improvements.

We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of in infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimation method that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators.Our key contribution is a novel approach to estimating the density ratio of two stationary distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a closed-form solution for the case of RKHS. We support our method with both theoretical and empirical analyses.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes