LGAIFeb 17, 2025

Massively Scaling Explicit Policy-conditioned Value Functions

arXiv:2502.11949v1h-index: 4
Originality Incremental advance
AI Analysis

This work addresses scaling bottlenecks in reinforcement learning for continuous-control domains, representing an incremental improvement with novel architectural and exploration techniques.

The paper tackles the challenge of scaling Explicit Policy-Conditioned Value Functions (EPVFs) for continuous-control tasks by addressing issues like parameter growth and exploration, resulting in competitive performance with state-of-the-art methods like PPO and SAC on complex tasks such as a custom Ant environment.

We introduce a scaling strategy for Explicit Policy-Conditioned Value Functions (EPVFs) that significantly improves performance on challenging continuous-control tasks. EPVFs learn a value function V(θ) that is explicitly conditioned on the policy parameters, enabling direct gradient-based updates to the parameters of any policy. However, EPVFs at scale struggle with unrestricted parameter growth and efficient exploration in the policy parameter space. To address these issues, we utilize massive parallelization with GPU-based simulators, big batch sizes, weight clipping and scaled peturbations. Our results show that EPVFs can be scaled to solve complex tasks, such as a custom Ant environment, and can compete with state-of-the-art Deep Reinforcement Learning (DRL) baselines like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). We further explore action-based policy parameter representations from previous work and specialized neural network architectures to efficiently handle weight-space features, which have not been used in the context of DRL before.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes