LGAIMay 7

SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

arXiv:2605.0586350.41 citations
AI Analysis

For practitioners combining offline and online RL, SOPE eliminates manual tuning of stabilization phases while improving both performance and computational efficiency.

SOPE uses an actor-aligned Off-Policy Policy Evaluation signal to dynamically control offline training length in online RL with prior data, improving baseline performance by up to 45.6% and reducing TFLOPs by up to 22x on 25 continuous control tasks.

Incorporating prior data into online reinforcement learning accelerates training but typically forces a difficult trade-off between high computational costs and long, multi-stage training pipelines. While fixed-length stabilization phases are significantly more computationally efficient than static update schedules, they require task-dependent manual tuning, risking either the waste of prior knowledge or severe overfitting. To address this, we propose SOPE, an algorithm that uses an actor-aligned Off-Policy Policy Evaluation (OPE) signal as an automated early-stopping mechanism to dynamically control the length of offline training phases. By evaluating the critic on a held-out validation split under the current policy's action distribution, SOPE halts gradient updates exactly when out-of-distribution benefits saturate, eliminating the need for manual schedule tuning. Evaluated on 25 continuous control tasks from the Minari benchmark suite, SOPE improves baseline performance by up to 45.6% while reducing the required TFLOPs by up to 22x, thus balancing the tradeoff between sample and computational efficiency. These findings demonstrate that adaptive, evaluation-driven update schedules are more effective than relying on static, exhaustive update schedules.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes