Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization
For practitioners fine-tuning LLMs with RL, this method offers a way to enhance exploration diversity while maintaining task accuracy, though it is an incremental improvement over existing entropy regularization techniques.
Policy Split introduces a dual-mode entropy regularization method for RL fine-tuning of LLMs that separates the policy into normal and high-entropy modes, improving exploration without sacrificing accuracy. It consistently outperforms entropy-guided RL baselines across model sizes in general and creative tasks.
To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.