CLAIOct 25, 2024

2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

arXiv:2410.19720v112 citationsh-index: 13NAACL
Originality Incremental advance
AI Analysis

This work addresses the problem of aligning Large Language Models more effectively with complex human preferences for researchers and practitioners, representing an incremental advancement over existing DPO methods.

The paper tackles the limitation of Direct Preference Optimization (DPO) in handling multi-dimensional human preferences by extending it to two dimensions—segments and aspects—using a new dataset called HelpSteer-2D, resulting in improved performance over scalar or 1-dimensional methods on popular benchmarks.

Recent advancements in Direct Preference Optimization (DPO) have significantly enhanced the alignment of Large Language Models (LLMs) with human preferences, owing to its simplicity and effectiveness. However, existing methods typically optimize a scalar score or ranking reward, thereby overlooking the multi-dimensional nature of human preferences. In this work, we propose to extend the preference of DPO to two dimensions: segments and aspects. We first introduce a 2D supervision dataset called HelpSteer-2D. For the segment dimension, we divide the response into sentences and assign scores to each segment. For the aspect dimension, we meticulously design several criteria covering the response quality rubrics. With the 2-dimensional signals as feedback, we develop a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives. Extensive experiments on popular benchmarks demonstrate that 2D-DPO performs better than methods that optimize for scalar or 1-dimensional preferences.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes