LG MLOct 3, 2025

Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Finite-Horizon Offline RL with Linear $q^π$-Realizability and Concentrability

Volodymyr Tkachuk, Csaba Szepesvári, Xiaoqi Tan

arXiv:2510.03494v14.1h-index: 2

Originality Incremental advance

AI Analysis

This addresses a theoretical limitation in offline RL for policy evaluation, providing an efficient method under specific assumptions, though it is incremental as it builds on prior work.

The paper tackles the problem of finite-horizon offline reinforcement learning for policy evaluation, showing that statistically efficient learning is possible when data is given as trajectories, under assumptions of linear realizability and concentrability, and improves sample complexity for policy optimization.

We study finite-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for either of these problems when the only assumptions are that the data has good coverage (concentrability) and the state-action value function of every policy is linearly realizable ($q^π$-realizability) (Foster et al., 2021). Recently, Tkachuk et al. (2024) gave a statistically efficient learner for policy optimization, if in addition the data is assumed to be given as trajectories. In this work we present a statistically efficient learner for policy evaluation under the same assumptions. Further, we show that the sample complexity of the learner used by Tkachuk et al. (2024) for policy optimization can be improved by a tighter analysis.

View on arXiv PDF

Similar