CLAIJul 21, 2024

A Practical Analysis of Human Alignment with *PO

arXiv:2407.15229v217 citationsh-index: 32
Originality Incremental advance
AI Analysis

This work addresses the practical challenge of hyperparameter sensitivity in human alignment methods for AI practitioners, offering an incremental improvement with LN-DPO.

The paper analyzed the robustness of state-of-the-art human alignment methods like *PO to hyperparameter variations in out-of-distribution scenarios, finding that methods perform similarly at peak but vary significantly when conditions deviate, and introduced LN-DPO, a length-normalized version of DPO that improves stability and reduces average response length.

At the forefront of state-of-the-art human alignment methods are preference optimization methods (*PO). Prior research has often concentrated on identifying the best-performing method, typically involving a grid search over hyperparameters, which can be impractical for general practitioners. In this paper, we examine the robustness of existing state-of-the-art methods to varying hyperparameters in a realistic out-of-distribution (OOD) scenario that mirrors real-world applications of human alignment. Our goal is to empirically find the method that increases the likelihood of achieving better results through the lens of various metrics, such as KL divergence and response length. We also introduce LN-DPO, a simple length-normalized version of DPO that is more stable across hyperparameters, effectively reduces the average response length, and improves performance. Our analysis of state-of-the-art reference-free (i.e., SimPO) and reference-dependent (i.e., DPO and LN-DPO) methods reveals that they perform similarly at their peak (i.e., best possible scenario). However, we uncover that the pattern of change in performance greatly varies as we move away from the best possible scenario.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes