LGAIMLJun 27, 2025

Exploration Behavior of Untrained Policies

arXiv:2506.22566v3
Originality Incremental advance
AI Analysis

This provides a framework for using policy initialization to understand exploration in early RL training, though it is incremental with limited practical impact.

The paper studied how deep neural network policy architectures shape exploration behavior before training in reinforcement learning, showing theoretically and empirically that untrained policies can generate ballistic or diffusive trajectories in a toy model.

Exploration remains a fundamental challenge in reinforcement learning (RL), particularly in environments with sparse or adversarial reward structures. In this work, we study how the architecture of deep neural policies implicitly shapes exploration before training. We theoretically and empirically demonstrate strategies for generating ballistic or diffusive trajectories from untrained policies in a toy model. Using the theory of infinite-width networks and a continuous-time limit, we show that untrained policies return correlated actions and result in non-trivial state-visitation distributions. We discuss the distributions of the corresponding trajectories for a standard architecture, revealing insights into inductive biases for tackling exploration. Our results establish a theoretical and experimental framework for using policy initialization as a design tool to understand exploration behavior in early training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes