LGAIGTMar 17

Bi-Level Policy Optimization with Nyström Hypergradients

arXiv:2505.1171410.11 citationsh-index: 7
Predicted impact top 77% in LG · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the numerical instability in hypergradient computation for reinforcement learning, offering a more robust method for researchers and practitioners, though it is incremental as it builds on existing actor-critic frameworks.

The paper tackles the bilevel optimization structure in actor-critic reinforcement learning by proposing BLPO, which uses nesting and Nyström hypergradients to improve stability and convergence, achieving performance comparable to or better than PPO on control tasks.

The dependency of the actor on the critic in actor-critic (AC) reinforcement learning means that AC can be characterized as a bilevel optimization (BLO) problem, also called a Stackelberg game. This characterization motivates two modifications to vanilla AC algorithms. First, the critic's update should be nested to learn a best response to the actor's policy. Second, the actor should update according to a hypergradient that takes changes in the critic's behavior into account. Computing this hypergradient involves finding an inverse Hessian vector product, a process that can be numerically unstable. We thus propose a new algorithm, Bilevel Policy Optimization with Nyström Hypergradients (BLPO), which uses nesting to account for the nested structure of BLO, and leverages the Nyström method to compute the hypergradient. Theoretically, we prove BLPO converges to (a point that satisfies the necessary conditions for) a local strong Stackelberg equilibrium in polynomial time with high probability, assuming a linear parametrization of the critic's objective. Empirically, we demonstrate that BLPO performs on par with or better than PPO on a variety of discrete and continuous control tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes