LGJun 16, 2025

Learning to Explore in Diverse Reward Settings via Temporal-Difference-Error Maximization

arXiv:2506.13345v29.43 citationsh-index: 1Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of exploration robustness for reinforcement learning practitioners, offering a method that works across different reward types without manual adjustments, though it appears incremental as it builds on existing TD-error maximization ideas.

The paper tackles the challenge of exploration in reinforcement learning across diverse reward settings, proposing Stable Error-seeking Exploration (SEE) which robustly performs in dense, sparse, and exploration-adverse scenarios without hyperparameter tuning, as demonstrated with a Soft-Actor Critic agent in various tasks.

Numerous heuristics and advanced approaches have been proposed for exploration in different settings for deep reinforcement learning. Noise-based exploration generally fares well with dense-shaped rewards and bonus-based exploration with sparse rewards. However, these methods usually require additional tuning to deal with undesirable reward settings by adjusting hyperparameters and noise distributions. Rewards that actively discourage exploration, i.e., with an action cost and no other dense signal to follow, can pose a major challenge. We propose a novel exploration method, Stable Error-seeking Exploration (SEE), that is robust across dense, sparse, and exploration-adverse reward settings. To this endeavor, we revisit the idea of maximizing the TD-error as a separate objective. Our method introduces three design choices to mitigate instability caused by far-off-policy learning, the conflict of interest of maximizing the cumulative TD-error in an episodic setting, and the non-stationary nature of TD-errors. SEE can be combined with off-policy algorithms without modifying the optimization pipeline of the original objective. In our experimental analysis, we show that a Soft-Actor Critic agent with the addition of SEE performs robustly across three diverse reward settings in a variety of tasks without hyperparameter adjustments.

View on arXiv PDF Code

Similar