Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation
This provides a foundational advance for reinforcement learning by eliminating polynomial dependency on planning horizons, benefiting researchers and practitioners in AI and control systems.
The paper tackles long planning horizon problems in reinforcement learning with general function approximation by proposing the UCRL-WVTR algorithm, which achieves a horizon-free and instance-dependent regret bound that matches the minimax lower bound for linear mixture MDPs up to logarithmic factors.
To tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon. The derived regret bound is deemed \emph{sharp}, as it matches the minimax lower bound when specialized to linear mixture MDPs up to logarithmic factors. Furthermore, UCRL-WVTR is \emph{computationally efficient} with access to a regression oracle. The achievement of such a horizon-free, instance-dependent, and sharp regret bound hinges upon (i) novel algorithm designs: weighted value-targeted regression and a high-order moment estimator in the context of general function approximation; and (ii) fine-grained analyses: a novel concentration bound of weighted non-linear least squares and a refined analysis which leads to the tight instance-dependent bound. We also conduct comprehensive experiments to corroborate our theoretical findings.