CLDec 21, 2025

Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

arXiv:2512.18906v11 citationsh-index: 17
Originality Highly original
AI Analysis

This addresses the need for more interpretable and robust machine translation evaluation metrics for researchers and practitioners, though it builds incrementally on existing generative approaches.

The paper tackles the problem of black-box machine translation evaluation metrics by introducing Remedy-R, a reasoning-driven generative metric that produces interpretable step-by-step analyses without requiring error annotations. The result is a system that remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, exhibits strong robustness on out-of-distribution tests, and enables a practical evaluate-revise pipeline that improves translation quality across diverse models.

Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes