LGMay 12

Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs

Michael Lu, Max Qiushi Lin, Mo Chen, Sharan Vaswani

arXiv:2605.1169470.4

Predicted impact top 23% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For reinforcement learning practitioners deploying single policies in constrained environments, this work provides theoretical guarantees for a practical algorithm, though it is incremental over existing augmented Lagrangian methods.

The paper addresses the lack of last-iterate convergence guarantees for constrained MDPs in practical settings, proposing an augmented Lagrangian framework that achieves global last-iterate convergence for tabular and log-linear policies, and scales to non-linear policies with empirical validation on continuous control tasks.

We study policy optimization for infinite-horizon, discounted constrained Markov decision processes (CMDPs). While existing theoretical guarantees typically hold for the mixture policy, deploying such a policy is computationally and memory intensive. This leads to a practical mismatch where a single (last-iterate) policy must be deployed. Recent theoretical works have thus focused on proving last-iterate convergence, but are largely limited to the tabular setting or to algorithmic variants that are rarely used in practice. To address this, we use the classic inexact augmented Lagrangian ($\texttt{AL}$) method from constrained optimization, and propose a general framework with provable last-iterate convergence for CMDPs. We first focus on the tabular setting and propose to solve the $\texttt{AL}$ sub-problem with projected Q-ascent ($\texttt{PQA}$). Combining the theoretical guarantees of $\texttt{PQA}$ and the standard $\texttt{AL}$ analysis enables us to establish global last-iterate convergence. We generalize these results to handle log-linear policies, and demonstrate that an efficient, projected variant of $\texttt{PQA}$ can achieve last-iterate convergence with comparable guarantees as prior work. Finally, we demonstrate that our framework scales to complex non-linear policies, and evaluate it on continuous control tasks.

View on arXiv PDF

Similar