LGJun 3

Reusing Trajectories in Policy Gradients Enables Fast Convergence

Alessandro Montenegro, Federico Mansutti, Marco Mussi, Matteo Papini, Alberto Maria Metelli

arXiv:2506.0617841.0

AI Analysis

For researchers in reinforcement learning, this work provides the first theoretical proof that reusing past trajectories can significantly accelerate policy gradient convergence, addressing a key bottleneck in sample efficiency.

The paper proposes RT-PG, a policy gradient algorithm that reuses past off-policy trajectories to improve sample efficiency, achieving a sample complexity of O(ε^{-2}ω^{-1}) and, with full reuse, O(ε^{-1}), the best known rate for PG methods.

Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. They rely on fresh on-policy data, making them sample-inefficient and requiring $O(ε^{-2})$ trajectories to reach an $ε$-approximate stationary point. A common strategy to improve efficiency is to reuse information from past iterations, such as previous gradients or trajectories, leading to off-policy PG methods. While gradient reuse has received substantial attention, leading to improved rates up to $O(ε^{-3/2})$, the reuse of past trajectories, although intuitive, remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that reusing past off-policy trajectories can significantly accelerate PG convergence. We propose RT-PG (Reusing Trajectories - Policy Gradient), a novel algorithm that leverages a power mean-corrected multiple importance weighting estimator to effectively combine on-policy and off-policy data coming from the most recent $ω$ iterations. Through a novel analysis, we prove that RT-PG achieves a sample complexity of $\tilde{O}(ε^{-2}ω^{-1})$. When reusing all available past trajectories, this leads to a rate of $\tilde{O}(ε^{-1})$, the best known one in the literature for PG methods. We further validate our approach empirically, demonstrating its effectiveness against baselines with state-of-the-art rates.

View on arXiv PDF

Similar