Rate-matching the regret lower-bound in the linear quadratic regulator with unknown dynamics
This work addresses a fundamental mismatch in reinforcement learning theory for a key setting, providing tight bounds that enhance understanding of sample efficiency, safety, and robustness.
The paper tackles the gap between regret lower- and upper-bounds in the linear quadratic regulator with unknown dynamics, achieving a novel regret upper-bound of O_p(√T) that matches the known lower-bound, and simultaneously establishes an estimation error bound of O_p(T^{-1/4}) that also matches a lower-bound rate.
The theory of reinforcement learning currently suffers from a mismatch between its empirical performance and the theoretical characterization of its performance, with consequences for, e.g., the understanding of sample efficiency, safety, and robustness. The linear quadratic regulator with unknown dynamics is a fundamental reinforcement learning setting with significant structure in its dynamics and cost function, yet even in this setting there is a gap between the best known regret lower-bound of $Ω_p(\sqrt{T})$ and the best known upper-bound of $O_p(\sqrt{T}\,\text{polylog}(T))$. The contribution of this paper is to close that gap by establishing a novel regret upper-bound of $O_p(\sqrt{T})$. Our proof is constructive in that it analyzes the regret of a concrete algorithm, and simultaneously establishes an estimation error bound on the dynamics of $O_p(T^{-1/4})$ which is also the first to match the rate of a known lower-bound. The two keys to our improved proof technique are (1) a more precise upper- and lower-bound on the system Gram matrix and (2) a self-bounding argument for the expected estimation error of the optimal controller.