LGCLSep 24, 2025

Failure Modes of Maximum Entropy RLHF

arXiv:2509.20265v1h-index: 14
Originality Synthesis-oriented
AI Analysis

This identifies failure modes for reference-free approaches in online preference learning, which is incremental but relevant for RLHF practitioners.

The paper investigates whether Maximum Entropy Reinforcement Learning can achieve strong results in online RLHF settings, finding that it consistently exhibits overoptimization and unstable KL dynamics, unlike KL-constrained methods that maintain stable training.

In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning with length-normalized temperature, providing a theoretical foundation for this reference-free method. Motivated by SimPO's strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL consistently exhibits overoptimization and unstable KL dynamics, even at very low learning rates. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to prevent reward hacking and appears to correlate with overoptimization. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online or offline preference learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes