LGFeb 23, 2021

Mixed Policy Gradient: off-policy reinforcement learning driven jointly by data and model

arXiv:2102.11513v213 citations
AI Analysis

This addresses the convergence-speed problem in reinforcement learning for sequential decision-making applications, though it is an incremental improvement over existing methods.

The paper tackles the slow convergence of data-driven reinforcement learning by proposing the Mixed Policy Gradient (MPG) algorithm, which combines data-driven and model-driven policy gradients to accelerate convergence without performance loss, achieving the best asymptotic performance and convergence speed in simulations.

Reinforcement learning (RL) shows great potential in sequential decision-making. At present, mainstream RL algorithms are data-driven, which usually yield better asymptotic performance but much slower convergence compared with model-driven methods. This paper proposes mixed policy gradient (MPG) algorithm, which fuses the empirical data and the transition model in policy gradient (PG) to accelerate convergence without performance degradation. Formally, MPG is constructed as a weighted average of the data-driven and model-driven PGs, where the former is the derivative of the learned Q-value function, and the latter is that of the model-predictive return. To guide the weight design, we analyze and compare the upper bound of each PG error. Relying on that, a rule-based method is employed to heuristically adjust the weights. In particular, to get a better PG, the weight of the data-driven PG is designed to grow along the learning process while the other to decrease. Simulation results show that the MPG method achieves the best asymptotic performance and convergence speed compared with other baseline algorithms.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes