LG OCMay 12

Learning Weakly Communicating Average-Reward CMDPs: Strong Duality and Improved Regret

arXiv:2605.1158655.0

AI Analysis

Provides theoretical foundations and improved regret bounds for reinforcement learning under constraints in average-reward settings, benefiting researchers in constrained RL.

The paper establishes strong duality for weakly communicating average-reward CMDPs and proposes a primal-dual algorithm achieving regret and constraint violation bounds of Õ(T^{2/3}), improving over prior work.

We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the weakly communicating assumption. Our contributions are twofold. First, we establish strong duality for weakly communicating average-reward CMDPs over stationary policies with finite state and action spaces. Despite the absence of a linear programming formulation and the resulting nonconvexity under the weakly communicating setting, we show that strong duality still holds by carefully exploiting the geometric structure of the occupation measure set. Second, building on this result, we propose a primal--dual clipped value iteration algorithm for learning weakly communicating average-reward linear CMDPs. Our algorithm achieves regret and constraint violation bounds of $\widetilde{\mathcal{O}}(T^{2/3})$, improving upon the best known bounds, where $T$ denotes the number of interactions. Our approach extends clipped value iteration to the constrained setting and adapts it to a finite-horizon approximation, which stabilizes the dual variable and is crucial for achieving improved regret bounds. To analyze this, we develop a novel approach based on strong duality that enables the decomposition of the composite Lagrangian regret into separate bounds on regret and constraint violation.

View on arXiv PDF

Similar