CLAILGMay 24, 2025

Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization

arXiv:2505.18720v13 citationsh-index: 8Has CodeACL
Originality Incremental advance
AI Analysis

This addresses a bottleneck in aligning large language models with human preferences for better performance, though it appears incremental as it builds on DPO.

The paper tackles the problem of suboptimal preference optimization in Direct Preference Optimization (DPO) by proposing OTPO, an optimal transport-based token weighting scheme that emphasizes meaningful tokens, leading to improved instruction-following ability in experiments.

Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbf{O}ptimal \textbf{T}ransport-based token weighting scheme for enhancing direct \textbf{P}reference \textbf{O}ptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO's effectiveness in improving instruction-following ability across various settings\footnote{Code is available at https://github.com/Mimasss2/OTPO.}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes