LGOCJan 23

A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning

arXiv:2601.16399v2h-index: 4
Originality Incremental advance
AI Analysis

This addresses inefficiencies in bi-level RL for applications like RLHF, though it is incremental as it builds on existing regularization and convergence analysis methods.

The paper tackles the problem of bi-level optimization in reinforcement learning, where an upper-level objective depends on a lower-level policy optimization, by proposing a single-loop actor-critic algorithm with entropy regularization, achieving convergence to a stationary point and validating it on tasks like GridWorld and RLHF.

We study a structured bi-level optimization problem where the upper-level objective is a smooth function and the lower-level problem is policy optimization in a Markov decision process (MDP). The upper-level decision variable parameterizes the reward of the lower-level MDP, and the upper-level objective depends on the optimal induced policy. Existing methods for bi-level optimization and RL often require second-order information, impose strong regularization at the lower level, or inefficiently use samples through nested-loop procedures. In this work, we propose a single-loop, first-order actor-critic algorithm that optimizes the bi-level objective via a penalty-based reformulation. We introduce into the lower-level RL objective an attenuating entropy regularization, which enables asymptotically unbiased upper-level hyper-gradient estimation without solving the unregularized RL problem exactly. We establish the finite-time and finite-sample convergence of the proposed algorithm to a stationary point of the original, unregularized bi-level optimization problem through a novel lower-level residual analysis under a special type of Polyak-Lojasiewicz condition. We validate the performance of our method through experiments on a GridWorld goal position problem and on happy tweet generation through reinforcement learning from human feedback (RLHF).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes