Reusing Trajectories in Policy Gradients Enables Fast Convergence
For researchers in reinforcement learning, this work provides the first theoretical proof that reusing past trajectories can significantly accelerate policy gradient convergence, addressing a key bottleneck in sample efficiency.
The paper proposes RT-PG, a policy gradient algorithm that reuses past off-policy trajectories to improve sample efficiency, achieving a sample complexity of O(ε^{-2}ω^{-1}) and, with full reuse, O(ε^{-1}), the best known rate for PG methods.
Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. They rely on fresh on-policy data, making them sample-inefficient and requiring $O(ε^{-2})$ trajectories to reach an $ε$-approximate stationary point. A common strategy to improve efficiency is to reuse information from past iterations, such as previous gradients or trajectories, leading to off-policy PG methods. While gradient reuse has received substantial attention, leading to improved rates up to $O(ε^{-3/2})$, the reuse of past trajectories, although intuitive, remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that reusing past off-policy trajectories can significantly accelerate PG convergence. We propose RT-PG (Reusing Trajectories - Policy Gradient), a novel algorithm that leverages a power mean-corrected multiple importance weighting estimator to effectively combine on-policy and off-policy data coming from the most recent $ω$ iterations. Through a novel analysis, we prove that RT-PG achieves a sample complexity of $\tilde{O}(ε^{-2}ω^{-1})$. When reusing all available past trajectories, this leads to a rate of $\tilde{O}(ε^{-1})$, the best known one in the literature for PG methods. We further validate our approach empirically, demonstrating its effectiveness against baselines with state-of-the-art rates.