CLJan 12

Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation

arXiv:2601.07338v13 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of evaluating machine translation in linguistically complex domains like social media and literature, where current metrics fail, though it is an incremental improvement focused on a specific evaluation bottleneck.

The paper tackles the problem of inaccurate machine translation evaluation metrics for non-literal translations by creating the MENT dataset and proposing the RATE framework, which improves meta scores by at least 3.2 points compared to existing methods.

Large Language Models (LLMs) have significantly advanced Machine Translation (MT), applying them to linguistically complex domains-such as Social Network Services, literature etc. In these scenarios, translations often require handling non-literal expressions, leading to the inaccuracy of MT metrics. To systematically investigate the reliability of MT metrics, we first curate a meta-evaluation dataset focused on non-literal translations, namely MENT. MENT encompasses four non-literal translation domains and features source sentences paired with translations from diverse MT systems, with 7,530 human-annotated scores on translation quality. Experimental results reveal the inaccuracies of traditional MT metrics and the limitations of LLM-as-a-Judge, particularly the knowledge cutoff and score inconsistency problem. To mitigate these limitations, we propose RATE, a novel agentic translation evaluation framework, centered by a reflective Core Agent that dynamically invokes specialized sub-agents. Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics. Further experiments demonstrate the robustness of RATE to general-domain MT evaluation. Code and dataset are available at: https://github.com/BITHLP/RATE.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes