LGJun 24, 2025

Multi-Preference Lambda-weighted Listwise DPO for Small-Scale Model Alignment

arXiv:2506.19780v53 citationsh-index: 3
Originality Highly original
AI Analysis

This enables more efficient and controllable alignment for compute-constrained settings like academic labs or on-device assistants.

The paper tackles the problem of aligning small language models with human preferences by proposing Multi-Preference Lambda-weighted Listwise DPO, which outperforms standard DPO on alignment benchmarks for 1B-2B scale models while requiring only 20GB of GPU memory.

Large language models (LLMs) demonstrate strong generalization across a wide range of language tasks, but often generate outputs that misalign with human preferences. Reinforcement Learning from Human Feedback (RLHF) addresses this by optimizing models toward human preferences using a learned reward function and reinforcement learning, yielding improved alignment but suffering from high computational cost and instability. Direct Preference Optimization (DPO) simplifies the process by treating alignment as a classification task over binary preference pairs, reducing training overhead while achieving competitive performance. However, it assumes fixed, single-dimensional preferences and only supports pairwise supervision. To address these limitations, we propose Multi-Preference Lambda-weighted Listwise DPO, which allows the model to learn from more detailed human feedback and flexibly balance multiple goals such as helpfulness, honesty, and fluency. Our method models full-ranked preference distributions rather than binary comparisons, enabling more informative learning signals. The lambda vector controls the relative importance of different alignment goals, allowing the model to generalize across diverse human objectives. During inference, lambda can be adjusted without retraining, providing controllable alignment behavior for downstream use. We also introduce a learned scheduler that dynamically samples performant lambda configurations to improve robustness. Notably, our method requires only 20GB of GPU memory for training, making it suitable for compute-constrained settings such as academic labs, educational tools, or on-device assistants. Experiments on 1B-2B scale models show that our method consistently outperforms standard DPO on alignment benchmarks while enabling efficient, controllable, and fine-grained adaptation suitable for real-world deployment.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes