LGAIROJul 24, 2023

Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation

arXiv:2307.12983v1Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of slow training times in reinforcement learning for researchers and practitioners by enabling efficient off-policy learning on a single workstation, representing an incremental improvement over existing distributed methods.

The paper tackles the challenge of scaling off-policy reinforcement learning, which is data-efficient but slow, by introducing Parallel Q-Learning (PQL) that leverages massively parallel GPU-based simulation to outperform PPO in wall-clock time while maintaining superior sample efficiency, achieving scaling to tens of thousands of parallel environments.

Reinforcement learning is time-consuming for complex tasks due to the need for large amounts of training data. Recent advances in GPU-based simulation, such as Isaac Gym, have sped up data collection thousands of times on a commodity GPU. Most prior works used on-policy methods like PPO due to their simplicity and ease of scaling. Off-policy methods are more data efficient but challenging to scale, resulting in a longer wall-clock training time. This paper presents a Parallel $Q$-Learning (PQL) scheme that outperforms PPO in wall-clock time while maintaining superior sample efficiency of off-policy learning. PQL achieves this by parallelizing data collection, policy learning, and value learning. Different from prior works on distributed off-policy learning, such as Apex, our scheme is designed specifically for massively parallel GPU-based simulation and optimized to work on a single workstation. In experiments, we demonstrate that $Q$-learning can be scaled to \textit{tens of thousands of parallel environments} and investigate important factors affecting learning speed. The code is available at https://github.com/Improbable-AI/pql.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes