LGJun 10, 2025

Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

arXiv:2506.08681v23 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses a critical problem for developers of large language models by improving alignment stability, though it is incremental as it builds on existing DAAs.

The paper tackles the problem of reward over-optimization in Direct Alignment Algorithms like DPO, which causes performance degradation as models drift from reference policies, and proposes an importance-sampling approach (IS-DAAs) that mitigates this issue, achieving better performance than other methods in experiments.

Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human values. However, these methods are more susceptible to over-optimization, in which the model drifts away from the reference policy, leading to degraded performance as training progresses. This paper proposes a novel importance-sampling approach to mitigate the over-optimization problem of offline DAAs. This approach, called (IS-DAAs), multiplies the DAA objective with an importance ratio that accounts for the reference policy distribution. IS-DAAs additionally avoid the high variance issue associated with importance sampling by clipping the importance ratio to a maximum value. Our extensive experiments demonstrate that IS-DAAs can effectively mitigate over-optimization, especially under low regularization strength, and achieve better performance than other methods designed to address this problem. Our implementations are provided publicly at this link.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes