LGMLDec 7, 2023

Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation

arXiv:2312.04464v14 citationsh-index: 13AISTATS
Originality Highly original
AI Analysis

This provides a foundational advance for reinforcement learning by eliminating polynomial dependency on planning horizons, benefiting researchers and practitioners in AI and control systems.

The paper tackles long planning horizon problems in reinforcement learning with general function approximation by proposing the UCRL-WVTR algorithm, which achieves a horizon-free and instance-dependent regret bound that matches the minimax lower bound for linear mixture MDPs up to logarithmic factors.

To tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon. The derived regret bound is deemed \emph{sharp}, as it matches the minimax lower bound when specialized to linear mixture MDPs up to logarithmic factors. Furthermore, UCRL-WVTR is \emph{computationally efficient} with access to a regression oracle. The achievement of such a horizon-free, instance-dependent, and sharp regret bound hinges upon (i) novel algorithm designs: weighted value-targeted regression and a high-order moment estimator in the context of general function approximation; and (ii) fine-grained analyses: a novel concentration bound of weighted non-linear least squares and a refined analysis which leads to the tight instance-dependent bound. We also conduct comprehensive experiments to corroborate our theoretical findings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes