LGDSNov 16, 2024

Efficient, Low-Regret, Online Reinforcement Learning for Linear MDPs

arXiv:2411.10906v1h-index: 17
Originality Incremental advance
AI Analysis

This work addresses space efficiency for practitioners in reinforcement learning, though it is incremental as it builds directly on an existing algorithm.

The authors tackled the high space usage of the LSVI-UCB algorithm in linear MDPs by proposing two modified versions that alternate learning periods, achieving low space and time usage while maintaining sublinear regret, as shown in experiments on synthetic and real-world data.

Reinforcement learning algorithms are usually stated without theoretical guarantees regarding their performance. Recently, Jin, Yang, Wang, and Jordan (COLT 2020) showed a polynomial-time reinforcement learning algorithm (namely, LSVI-UCB) for the setting of linear Markov decision processes, and provided theoretical guarantees regarding its running time and regret. In real-world scenarios, however, the space usage of this algorithm can be prohibitive due to a utilized linear regression step. We propose and analyze two modifications of LSVI-UCB, which alternate periods of learning and not-learning, to reduce space and time usage while maintaining sublinear regret. We show experimentally, on synthetic data and real-world benchmarks, that our algorithms achieve low space usage and running time, while not significantly sacrificing regret.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes