LG MLJul 3, 2024

Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes

arXiv:2407.03065v112.56 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work addresses a practical limitation in reinforcement learning algorithms for researchers and practitioners, though it is incremental as it builds on prior methods.

The paper tackles the problem of eliminating a costly warm-up phase in policy optimization for linear Markov Decision Processes, achieving rate-optimal regret with improved parameter dependencies in adversarial and stochastic loss settings.

Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on a costly pure exploration warm-up phase that is hard to implement in practice. This paper eliminates this undesired warm-up phase, replacing it with a simple and efficient contraction mechanism. Our PO algorithm achieves rate-optimal regret with improved dependence on the other parameters of the problem (horizon and function approximation dimension) in two fundamental settings: adversarial losses with full-information feedback and stochastic losses with bandit feedback.

View on arXiv PDF

Similar