ML LGFeb 19, 2024

When Do Off-Policy and On-Policy Policy Gradient Methods Align?

Davide Mambelli, Stephan Bongers, Onno Zoeter, Matthijs T. J. Spaan, Frans A. Oliehoek

arXiv:2402.12034v15.51 citationsh-index: 33

Originality Incremental advance

AI Analysis

This work addresses sample inefficiency for reinforcement learning practitioners, but it is incremental as it builds on existing off-policy objectives.

The paper tackles the problem of sample inefficiency in policy gradient methods by analyzing the difference between off-policy and on-policy objectives, providing theoretical conditions to reduce this gap and empirical evidence of issues when conditions are unmet.

Policy gradient methods are widely adopted reinforcement learning algorithms for tasks with continuous action spaces. These methods succeeded in many application domains, however, because of their notorious sample inefficiency their use remains limited to problems where fast and accurate simulations are available. A common way to improve sample efficiency is to modify their objective function to be computable from off-policy samples without importance sampling. A well-established off-policy objective is the excursion objective. This work studies the difference between the excursion objective and the traditional on-policy objective, which we refer to as the on-off gap. We provide the first theoretical analysis showing conditions to reduce the on-off gap while establishing empirical evidence of shortfalls arising when these conditions are not met.

View on arXiv PDF

Similar