LGAISYOCJul 24, 2024

Sublinear Regret for a Class of Continuous-Time Linear-Quadratic Reinforcement Learning Problems

arXiv:2407.17226v67 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses control problems in stochastic systems with complex volatilities, offering an incremental improvement in regret bounds over model-based methods.

The paper tackles reinforcement learning for continuous-time linear-quadratic control problems with state- and control-dependent volatilities, achieving a sublinear regret bound of O(N^{3/4}) up to a logarithmic factor through a model-free algorithm.

We study reinforcement learning (RL) for a class of continuous-time linear-quadratic (LQ) control problems for diffusions, where states are scalar-valued and running control rewards are absent but volatilities of the state processes depend on both state and control variables. We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an RL algorithm to learn the optimal policy parameter directly. Our main contributions include the introduction of an exploration schedule and a regret analysis of the proposed algorithm. We provide the convergence rate of the policy parameter to the optimal one, and prove that the algorithm achieves a regret bound of $O(N^{\frac{3}{4}})$ up to a logarithmic factor, where $N$ is the number of learning episodes. We conduct a simulation study to validate the theoretical results and demonstrate the effectiveness and reliability of the proposed algorithm. We also perform numerical comparisons between our method and those of the recent model-based stochastic LQ RL studies adapted to the state- and control-dependent volatility setting, demonstrating a better performance of the former in terms of regret bounds.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes