Learning and Planning in Complex Action Spaces
This addresses a bottleneck in applying reinforcement learning to complex real-world domains like robotics, though it is incremental as it builds on existing methods like MuZero.
The paper tackles the problem of reinforcement learning in high-dimensional or continuous action spaces where full enumeration is infeasible, proposing a sample-based policy iteration framework and demonstrating it with Sampled MuZero on Go and continuous control benchmarks, achieving competitive performance.
Many important real-world problems have action spaces that are high-dimensional, continuous or both, making full enumeration of all possible actions infeasible. Instead, only small subsets of actions can be sampled for the purpose of policy evaluation and improvement. In this paper, we propose a general framework to reason in a principled way about policy evaluation and improvement over such sampled action subsets. This sample-based policy iteration framework can in principle be applied to any reinforcement learning algorithm based upon policy iteration. Concretely, we propose Sampled MuZero, an extension of the MuZero algorithm that is able to learn in domains with arbitrarily complex action spaces by planning over sampled actions. We demonstrate this approach on the classical board game of Go and on two continuous control benchmark domains: DeepMind Control Suite and Real-World RL Suite.