LGJun 24, 2025

Multi-Preference Lambda-weighted Listwise DPO for Small-Scale Model Alignment

Yuhui Sun, Xiyao Wang, Zixi Li, Zhenlong Yuan, Jinman Zhao

arXiv:2506.19780v513.03 citationsh-index: 3Has Code

Originality Highly original

AI Analysis

This enables more efficient and controllable alignment for compute-constrained settings like academic labs or on-device assistants.

The paper tackles the problem of aligning small language models with human preferences by proposing Multi-Preference Lambda-weighted Listwise DPO, which outperforms standard DPO on alignment benchmarks for 1B-2B scale models while requiring only 20GB of GPU memory.

Large language models (LLMs) demonstrate strong generalization across a wide range of language tasks, but often generate outputs that misalign with human preferences. Reinforcement Learning from Human Feedback (RLHF) addresses this by optimizing models toward human preferences using a learned reward function and reinforcement learning, yielding improved alignment but suffering from high computational cost and instability. Direct Preference Optimization (DPO) simplifies the process by treating alignment as a classification task over binary preference pairs, reducing training overhead while achieving competitive performance. However, it assumes fixed, single-dimensional preferences and only supports pairwise supervision. To address these limitations, we propose Multi-Preference Lambda-weighted Listwise DPO, which allows the model to learn from more detailed human feedback and flexibly balance multiple goals such as helpfulness, honesty, and fluency. Our method models full-ranked preference distributions rather than binary comparisons, enabling more informative learning signals. The lambda vector controls the relative importance of different alignment goals, allowing the model to generalize across diverse human objectives. During inference, lambda can be adjusted without retraining, providing controllable alignment behavior for downstream use. We also introduce a learned scheduler that dynamically samples performant lambda configurations to improve robustness. Notably, our method requires only 20GB of GPU memory for training, making it suitable for compute-constrained settings such as academic labs, educational tools, or on-device assistants. Experiments on 1B-2B scale models show that our method consistently outperforms standard DPO on alignment benchmarks while enabling efficient, controllable, and fine-grained adaptation suitable for real-world deployment.

View on arXiv PDF Code

Similar