OCLGMLJun 27, 2020

Logarithmic regret for episodic continuous-time linear-quadratic reinforcement learning over a finite-time horizon

arXiv:2006.15316v451 citations
Originality Incremental advance
AI Analysis

This provides a theoretical guarantee for reinforcement learning in continuous-time control systems with unknown parameters, though it is incremental as it builds on existing LQ frameworks.

The paper tackles the problem of episodic continuous-time linear-quadratic reinforcement learning with unknown coefficients over a finite-time horizon, achieving a logarithmic regret bound of order O((ln M)(ln ln M)) for a least-squares algorithm based on continuous-time observations.

We study finite-time horizon continuous-time linear-quadratic reinforcement learning problems in an episodic setting, where both the state and control coefficients are unknown to the controller. We first propose a least-squares algorithm based on continuous-time observations and controls, and establish a logarithmic regret bound of order $O((\ln M)(\ln\ln M))$, with $M$ being the number of learning episodes. The analysis consists of two parts: perturbation analysis, which exploits the regularity and robustness of the associated Riccati differential equation; and parameter estimation error, which relies on sub-exponential properties of continuous-time least-squares estimators. We further propose a practically implementable least-squares algorithm based on discrete-time observations and piecewise constant controls, which achieves similar logarithmic regret with an additional term depending explicitly on the time stepsizes used in the algorithm.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes