AIJun 8

A Regret Minimization Framework on Preference Learning in Large Language Models

Suhwan Kim, Taehyun Cho, Geon-Hyeong Kim, Yu Jin Kim, Youngsoo Jang, Moontae Lee, Jungwoo Lee

arXiv:2606.09124v19.7

Predicted impact top 47% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners training LLMs with human feedback, RePO offers a more human-aligned alternative to reward maximization, though improvements are incremental over existing methods.

The paper introduces RePO, a regret minimization framework for RLHF that models human preferences as behavior-conditioned assessments of relative suboptimality, achieving consistent performance gains on mathematical reasoning and human preference datasets.

Reinforcement learning with verifiable rewards (RLVR) has enabled progress on reasoning-intensive tasks by relying on task-specific verifiers that provide automated correctness signals. However, many realistic language tasks are difficult to equip with reliable verifiers, motivating a growing reliance on reinforcement learning from human feedback (RLHF). In this setting, we argue that a closer examination of how human feedback should be interpreted is essential. We introduce Regret-based Preference Optimization $(\textbf{RePO})$, which reframes RLHF through $\textit{regret minimization}$ rather than reward maximization. Human preferences are often shaped by $\textit{prospective}$ anticipation of outcomes and $\textit{counterfactual}$ comparisons to alternative behaviors, rather than by immediate, outcome-independent utility. $\textbf{RePO}$ captures this structure by modeling preferences as behavior-conditioned assessments of relative suboptimality. Experiments on mathematical reasoning benchmarks and human preference datasets demonstrate consistent performance gains, indicating that $\textbf{RePO}$ is an effective and human-aligned approach for training large language models.

View on arXiv PDF

Similar