LGAIJun 2, 2021

On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction

arXiv:2106.00993v26 citations
AI Analysis

This work addresses convergence issues in off-policy reinforcement learning for researchers, providing theoretical guarantees but is incremental as it builds on existing methods.

The paper tackles the convergence properties of off-policy policy optimization methods with density-ratio correction under function approximation, proving finite-time convergence guarantees for two strategies: P-SREDA achieves an optimal rate of O(ε^{-3}), and O-SPIM matches on-policy actor-critic rates with O(ε^{-4}).

In this paper, we study the convergence properties of off-policy policy improvement algorithms with state-action density ratio correction under function approximation setting, where the objective function is formulated as a max-max-min optimization problem. We characterize the bias of the learning objective and present two strategies with finite-time convergence guarantees. In our first strategy, we present algorithm P-SREDA with convergence rate $O(ε^{-3})$, whose dependency on $ε$ is optimal. In our second strategy, we propose a new off-policy actor-critic style algorithm named O-SPIM. We prove that O-SPIM converges to a stationary point with total complexity $O(ε^{-4})$, which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes