LG AIJun 17, 2020

A maximum-entropy approach to off-policy evaluation in average-reward MDPs

Nevena Lazic, Dong Yin, Mehrdad Farajtabar, Nir Levine, Dilan Gorur, Chris Harris, Dale Schuurmans

arXiv:2006.12620v19.013 citations

Originality Incremental advance

AI Analysis

This addresses the problem of evaluating policies without direct interaction for researchers and practitioners in reinforcement learning, offering incremental improvements by extending existing results to new settings.

The paper tackles off-policy evaluation in infinite-horizon undiscounted Markov decision processes, providing the first finite-sample error bound for linear MDPs and proposing a maximum-entropy method for estimating stationary distributions with function approximation, demonstrating effectiveness in multiple environments.

This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e. where rewards and dynamics are linear in some known features), we provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases. In a more general setting, when the feature dynamics are approximately linear and for arbitrary rewards, we propose a new approach for estimating stationary distributions with function approximation. We formulate this problem as finding the maximum-entropy distribution subject to matching feature expectations under empirical dynamics. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning. We demonstrate the effectiveness of the proposed OPE approaches in multiple environments.

View on arXiv PDF

Similar