LGAIMLMay 23, 2024

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

arXiv:2405.14953v518 citationsh-index: 8ICLR
Originality Incremental advance
AI Analysis

This addresses a limitation in preference optimization for LLM fine-tuning, offering an incremental improvement with practical gains.

The paper tackles DPO's inability to capture diverse human preferences by introducing MallowsPO with a dispersion index, showing it enhances DPO performance across benchmarks and boosts Llama3-Instruct fine-tuning by nearly 2% LC win rate.

Direct Preference Optimization (DPO) has recently emerged as a popular approach to improve reinforcement learning with human feedback (RLHF), leading to better techniques to fine-tune large language models (LLM). A weakness of DPO, however, lies in its lack of capability to characterize the diversity of human preferences. Inspired by Mallows' theory of preference ranking, we develop in this paper a new approach, the MallowsPO. A distinct feature of this approach is a dispersion index, which reflects the dispersion of human preference to prompts. We show that existing DPO models can be reduced to special cases of this dispersion index, thus unified with MallowsPO. More importantly, we demonstrate (empirically) how to use this dispersion index to enhance the performance of DPO in a broad array of benchmark tasks, from synthetic bandit selection to controllable generations and dialogues, while maintaining great generalization capabilities. MallowsPO is also compatible with other SOTA offline preference optimization methods, boosting nearly 2\% extra LC win rate when used as a plugin for fine-tuning Llama3-Instruct.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes