LG AI SY OCJul 24, 2024

Sublinear Regret for a Class of Continuous-Time Linear-Quadratic Reinforcement Learning Problems

arXiv:2407.17226v612.57 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work addresses control problems in stochastic systems with complex volatilities, offering an incremental improvement in regret bounds over model-based methods.

The paper tackles reinforcement learning for continuous-time linear-quadratic control problems with state- and control-dependent volatilities, achieving a sublinear regret bound of O(N^{3/4}) up to a logarithmic factor through a model-free algorithm.

We study reinforcement learning (RL) for a class of continuous-time linear-quadratic (LQ) control problems for diffusions, where states are scalar-valued and running control rewards are absent but volatilities of the state processes depend on both state and control variables. We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an RL algorithm to learn the optimal policy parameter directly. Our main contributions include the introduction of an exploration schedule and a regret analysis of the proposed algorithm. We provide the convergence rate of the policy parameter to the optimal one, and prove that the algorithm achieves a regret bound of $O(N^{\frac{3}{4}})$ up to a logarithmic factor, where $N$ is the number of learning episodes. We conduct a simulation study to validate the theoretical results and demonstrate the effectiveness and reliability of the proposed algorithm. We also perform numerical comparisons between our method and those of the recent model-based stochastic LQ RL studies adapted to the state- and control-dependent volatility setting, demonstrating a better performance of the former in terms of regret bounds.

View on arXiv PDF

Similar