SP^2DPO: An LLM-assisted Semantic Per-Pair DPO Generalization

Chaoyue He, Xin Zhou, Di Wang, Hong Xu, Wei Liu, Chunyan Miao

arXiv:2601.22385v10.6h-index: 3

Originality Incremental advance

AI Analysis

This work addresses the challenge of noisy and varied preference data for LLM alignment, offering an incremental improvement over standard DPO.

The paper tackles the problem of heterogeneous preference pairs in Direct Preference Optimization (DPO) by introducing SP2DPO, a method that uses instance-specific temperatures based on semantic annotations from teacher LLMs, resulting in competitive performance on AlpacaEval 2.0 with improvements in length-controlled win rate on two out of four backbones.

Direct Preference Optimization (DPO) controls the trade-off between fitting preference labels and staying close to a reference model using a single global temperature beta, implicitly treating all preference pairs as equally informative. Real-world preference corpora are heterogeneous: they mix high-signal, objective failures (for example, safety, factuality, instruction violations) with low-signal or subjective distinctions (for example, style), and also include label noise. We introduce our method, SP2DPO (Semantic Per-Pair DPO), a generalization that replaces the global temperature with an instance-specific schedule beta_i pre-decided offline from structured semantic-gap annotations (category, magnitude, confidence) produced by teacher language models. We instantiate this procedure on the UltraFeedback preference corpus (59,960 pairs), enabling large-scale construction of an auditable beta_i artifact, and incur zero training-time overhead: the inner-loop optimizer remains standard DPO with beta set per pair. We focus our empirical study on AlpacaEval 2.0, reporting both raw win rate and length-controlled win rate. Across four open-weight, instruction-tuned student backbones (4B-8B), SP2DPO is competitive with a tuned global-beta DPO baseline and improves AlpacaEval 2.0 length-controlled win rate on two of four backbones, while avoiding per-model beta sweeps. All code, annotations, and artifacts will be released.

View on arXiv PDF

Similar