CLOct 15, 2025

Beyond Single-Reward: Multi-Pair, Multi-Perspective Preference Optimization for Machine Translation

Hao Wang, Linlong Xu, Heng Liu, Yangyang Liu, Xiaohu Zhao, Bo Zeng, Liangying Shao, Longyue Wang, Weihua Luo, Kaifu Zhang

arXiv:2510.13434v14.91 citationsh-index: 13

Originality Incremental advance

AI Analysis

This addresses inefficiencies in aligning LLMs to human preferences for machine translation, offering a more robust and data-efficient approach, though it is incremental as it builds on existing DPO frameworks.

The paper tackled flawed reward signals and inefficient data utilization in Direct Preference Optimization for machine translation by introducing M^2PO, which integrates a multi-perspective reward engine and multi-pair construction strategy, resulting in substantial outperformance over existing methods and competitive performance against proprietary LLMs on WMT21-22 benchmarks.

Direct Preference Optimization (DPO) is a powerful paradigm for aligning Large Language Models (LLMs) to human preferences in Machine Translation (MT), but current methods are hindered by two fundamental challenges: (1) flawed reward signals from Quality Estimation (QE) models that overlook critical errors like translation hallucination, and (2) inefficient data utilization that discards valuable learning signals by selecting only a single win-loss pair. To address these limitations, we introduce M^2PO: Multi-Pair, Multi-Perspective Preference Optimization. Our framework integrates a multi-perspective reward engine that creates a more robust signal by combining two key viewpoints: a new hallucination penalty for factuality, and an innovative dynamic quality score that adaptively fuses external evaluations with the model's own evolving judgment. This is synergistically paired with a multi-pair construction strategy that systematically creates a comprehensive set of preference pairs from the entire pool of translation candidates. This synergistic approach ensures the model learns from a richer spectrum of quality trade-offs, leading to more robust and faithful translations. On challenging WMT21-22 benchmarks, M^2PO substantially outperforms existing preference optimization methods and demonstrates highly competitive performance against leading proprietary LLMs.

View on arXiv PDF

Similar