LGMLFeb 25, 2021

Online Policy Gradient for Model Free Learning of Linear Quadratic Regulators with $\sqrt{T}$ Regret

arXiv:2102.12608v120 citations
Originality Highly original
AI Analysis

This work addresses the challenge of efficient control in reinforcement learning for practitioners by eliminating the need for costly model-based approaches, though it is incremental in improving regret guarantees for model-free methods.

The paper tackles the problem of learning to control a linear dynamical system with quadratic costs (LQR) without relying on system identification, and presents the first model-free algorithm that achieves regret scaling optimally with the time horizon T, specifically √T regret.

We consider the task of learning to control a linear dynamical system under fixed quadratic costs, known as the Linear Quadratic Regulator (LQR) problem. While model-free approaches are often favorable in practice, thus far only model-based methods, which rely on costly system identification, have been shown to achieve regret that scales with the optimal dependence on the time horizon T. We present the first model-free algorithm that achieves similar regret guarantees. Our method relies on an efficient policy gradient scheme, and a novel and tighter analysis of the cost of exploration in policy space in this setting.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes