Massively Scaling Explicit Policy-conditioned Value Functions
This work addresses scaling bottlenecks in reinforcement learning for continuous-control domains, representing an incremental improvement with novel architectural and exploration techniques.
The paper tackles the challenge of scaling Explicit Policy-Conditioned Value Functions (EPVFs) for continuous-control tasks by addressing issues like parameter growth and exploration, resulting in competitive performance with state-of-the-art methods like PPO and SAC on complex tasks such as a custom Ant environment.
We introduce a scaling strategy for Explicit Policy-Conditioned Value Functions (EPVFs) that significantly improves performance on challenging continuous-control tasks. EPVFs learn a value function V(θ) that is explicitly conditioned on the policy parameters, enabling direct gradient-based updates to the parameters of any policy. However, EPVFs at scale struggle with unrestricted parameter growth and efficient exploration in the policy parameter space. To address these issues, we utilize massive parallelization with GPU-based simulators, big batch sizes, weight clipping and scaled peturbations. Our results show that EPVFs can be scaled to solve complex tasks, such as a custom Ant environment, and can compete with state-of-the-art Deep Reinforcement Learning (DRL) baselines like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). We further explore action-based policy parameter representations from previous work and specialized neural network architectures to efficiently handle weight-space features, which have not been used in the context of DRL before.