OCLGMLOct 4, 2023

Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods

arXiv:2310.02671v212 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the challenge of efficiently optimizing policies in finite-time sequential decision-making problems, such as in training large language models, though it is incremental as it builds on existing policy gradient methods.

The paper tackles the problem of learning non-stationary policies in finite-horizon Markov Decision Processes by introducing dynamic policy gradient, which trains parameters backward in time, and shows it yields improved convergence bounds compared to simultaneous training.

Markov Decision Processes (MDPs) are a formal framework for modeling and solving sequential decision-making problems. In finite-time horizons such problems are relevant for instance for optimal stopping or specific supply chain problems, but also in the training of large language models. In contrast to infinite horizon MDPs optimal policies are not stationary, policies must be learned for every single epoch. In practice all parameters are often trained simultaneously, ignoring the inherent structure suggested by dynamic programming. This paper introduces a combination of dynamic programming and policy gradient called dynamic policy gradient, where the parameters are trained backwards in time. For the tabular softmax parametrisation we carry out the convergence analysis for simultaneous and dynamic policy gradient towards global optima, both in the exact and sampled gradient settings without regularisation. It turns out that the use of dynamic policy gradient training much better exploits the structure of finite- time problems which is reflected in improved convergence bounds.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes