LGAIFeb 27

Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies

Xiang Li, Nan Jiang, Yuheng Zhang
arXiv:2602.23811v1
Originality Incremental advance
AI Analysis

This work addresses a practical bottleneck in offline RL for researchers and practitioners by enabling the use of standalone parameterized policies, which are common in real-world applications.

The paper tackles the limitation of offline reinforcement learning algorithms to finite action spaces by extending theoretical guarantees to parameterized policies over large or continuous action spaces, showing how connecting mirror descent to natural policy gradient provides novel analyses and algorithmic insights.

We investigate the theoretical aspects of offline reinforcement learning (RL) under general function approximation. While prior works (e.g., Xie et al., 2021) have established the theoretical foundations of learning a good policy from offline data via pessimism, existing algorithms that are computationally tractable (often in an oracle-efficient sense), such as PSPI, only apply to finite and small action spaces. Moreover, these algorithms rely on state-wise mirror descent and require actors to be implicitly induced from the critic functions, failing to accommodate standalone policy parameterization which is ubiquitous in practice. In this work, we address these limitations and extend the theoretical guarantees to parameterized policy classes over large or continuous action spaces. When extending mirror descent to parameterized policies, we identify contextual coupling as the core difficulty, and show how connecting mirror descent to natural policy gradient leads to novel analyses, guarantees, and algorithmic insights, including a surprising unification between offline RL and imitation learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes