LGNov 11, 2025

PrefPoE: Advantage-Guided Preference Fusion for Learning Where to Explore

Zhihao Lin, Lin Wu, Zhen Tian, Jianglin Lan

arXiv:2511.08241v14.1h-index: 11

Originality Highly original

AI Analysis

This work addresses the problem of inefficient exploration for reinforcement learning practitioners, offering a novel method that enhances policy gradient methods across diverse domains.

The paper tackles the challenge of exploration in reinforcement learning by introducing PrefPoE, a framework that uses advantage-guided preference fusion to balance exploration and exploitation, resulting in significant performance improvements such as +321% on HalfCheetah-v4 and +276% on LunarLander-v2.

Exploration in reinforcement learning remains a critical challenge, as naive entropy maximization often results in high variance and inefficient policy updates. We introduce \textbf{PrefPoE}, a novel \textit{Preference-Product-of-Experts} framework that performs intelligent, advantage-guided exploration via the first principled application of product-of-experts (PoE) fusion for single-task exploration-exploitation balancing. By training a preference network to concentrate probability mass on high-advantage actions and fusing it with the main policy through PoE, PrefPoE creates a \textbf{soft trust region} that stabilizes policy updates while maintaining targeted exploration. Across diverse control tasks spanning both continuous and discrete action spaces, PrefPoE demonstrates consistent improvements: +321\% on HalfCheetah-v4 (1276~$\rightarrow$~5375), +69\% on Ant-v4, +276\% on LunarLander-v2, with consistently enhanced training stability and sample efficiency. Unlike standard PPO, which suffers from entropy collapse, PrefPoE sustains adaptive exploration through its unique dynamics, thereby preventing premature convergence and enabling superior performance. Our results establish that learning \textit{where to explore} through advantage-guided preferences is as crucial as learning how to act, offering a general framework for enhancing policy gradient methods across the full spectrum of reinforcement learning domains. Code and pretrained models are available in supplementary materials.

View on arXiv PDF

Similar