CLApr 18, 2025

Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling

arXiv:2504.13630v14 citationsh-index: 17EMNLP
Originality Highly original
AI Analysis

This work addresses the challenge of reliable evaluation in machine translation for researchers and practitioners, representing a strong incremental improvement over existing methods.

The authors tackled the problem of noisy and inconsistent human ratings in machine translation evaluation by proposing ReMedy, a framework that learns relative translation quality from pairwise preference data, achieving state-of-the-art performance across 39 language pairs and 111 MT systems, surpassing larger models like MetricX-13B and PaLM-540B.

A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes