LGOCMar 1, 2024

Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate

Georgia Tech
arXiv:2403.00675v24 citationsh-index: 5Oper Res
AI Analysis

This work addresses the data efficiency challenge in reinforcement learning for control applications, though it is incremental as it builds on existing policy gradient methods with theoretical justification.

The paper tackles the problem of inefficient data usage in reinforcement learning by reusing historical trajectories via importance sampling in natural policy gradient methods, showing that the approach is convergent and improves convergence rates.

Reinforcement learning provides a mathematical framework for learning-based control, whose success largely depends on the amount of data it can utilize. The efficient utilization of historical trajectories obtained from previous policies is essential for expediting policy optimization. Empirical evidence has shown that policy gradient methods based on importance sampling work well. However, existing literature often neglect the interdependence between trajectories from different iterations, and the good empirical performance lacks a rigorous theoretical justification. In this paper, we study a variant of the natural policy gradient method with reusing historical trajectories via importance sampling. We show that the bias of the proposed estimator of the gradient is asymptotically negligible, the resultant algorithm is convergent, and reusing past trajectories helps improve the convergence rate. We further apply the proposed estimator to popular policy optimization algorithms such as trust region policy optimization. Our theoretical results are verified on classical benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes