LG AIMay 13, 2024

Adaptive Exploration for Data-Efficient General Value Function Evaluations

Arushi Jain, Josiah P. Hanna, Doina Precup

arXiv:2405.07838v26.43 citationsh-index: 51Has CodeNIPS

Originality Incremental advance

AI Analysis

This addresses data efficiency issues for reinforcement learning researchers and practitioners working with multiple GVFs, though it is incremental as it builds on existing variance estimation methods.

The paper tackles the data efficiency problem in learning multiple General Value Functions (GVFs) in parallel by introducing GVFExplorer, which adaptively learns a single behavior policy to minimize total variance in returns, reducing environmental interactions and prediction errors in tabular and nonlinear settings like Mujoco environments.

General Value Functions (GVFs) (Sutton et al., 2011) represent predictive knowledge in reinforcement learning. Each GVF computes the expected return for a given policy, based on a unique reward. Existing methods relying on fixed behavior policies or pre-collected data often face data efficiency issues when learning multiple GVFs in parallel using off-policy methods. To address this, we introduce GVFExplorer, which adaptively learns a single behavior policy that efficiently collects data for evaluating multiple GVFs in parallel. Our method optimizes the behavior policy by minimizing the total variance in return across GVFs, thereby reducing the required environmental interactions. We use an existing temporal-difference-style variance estimator to approximate the return variance. We prove that each behavior policy update decreases the overall mean squared error in GVF predictions. We empirically show our method's performance in tabular and nonlinear function approximation settings, including Mujoco environments, with stationary and non-stationary reward signals, optimizing data usage and reducing prediction errors across multiple GVFs.

View on arXiv PDF Code

Similar