AI LGFeb 12, 2024

SPO: Sequential Monte Carlo Policy Optimisation

Matthew V Macfarlane, Edan Toledo, Donal Byrne, Paul Duckworth, Alexandre Laterre

arXiv:2402.07963v311.66 citationsh-index: 5NIPS

Originality Incremental advance

AI Analysis

This work addresses scaling issues in planning-based reinforcement learning for intelligent agents, offering a method that is applicable to both discrete and continuous action spaces without modifications.

The paper tackles the scaling challenges of tree-based search methods in reinforcement learning by introducing SPO, a model-based algorithm that provides robust policy improvement and efficient scaling, demonstrating statistically significant performance improvements over baselines in both discrete and continuous environments.

Leveraging planning during learning and decision-making is central to the long-term development of intelligent agents. Recent works have successfully combined tree-based search methods and self-play learning mechanisms to this end. However, these methods typically face scaling challenges due to the sequential nature of their search. While practical engineering solutions can partly overcome this, they often result in a negative impact on performance. In this paper, we introduce SPO: Sequential Monte Carlo Policy Optimisation, a model-based reinforcement learning algorithm grounded within the Expectation Maximisation (EM) framework. We show that SPO provides robust policy improvement and efficient scaling properties. The sample-based search makes it directly applicable to both discrete and continuous action spaces without modifications. We demonstrate statistically significant improvements in performance relative to model-free and model-based baselines across both continuous and discrete environments. Furthermore, the parallel nature of SPO's search enables effective utilisation of hardware accelerators, yielding favourable scaling laws.

View on arXiv PDF

Similar