CLJul 3, 2024

Evaluating Automatic Metrics with Incremental Machine Translation Systems

arXiv:2407.03277v223 citationsh-index: 49Has Code
Originality Synthesis-oriented
AI Analysis

This work provides a valuable testbed for researchers in machine translation to assess metric performance, though it is incremental as it builds on prior findings with a larger dataset.

The authors tackled the problem of evaluating automatic machine translation metrics by introducing a dataset of commercial translations collected over six years, and found that neural metrics generally outperform non-neural ones, with the dataset enabling deeper investigation into metric reliability as translation quality changes.

We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions. Since human A/B testing is commonly used, we assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations. Our study not only confirms several prior findings, such as the advantage of neural metrics over non-neural ones, but also explores the debated issue of how MT quality affects metric reliability--an investigation that smaller datasets in previous research could not sufficiently explore. Overall, our research demonstrates the dataset's value as a testbed for metric evaluation. We release our code at https://github.com/gjwubyron/Evo

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes