CLAIMay 24, 2023

MuLER: Detailed and Scalable Reference-based Evaluation

arXiv:2305.14991v2131 citations
Originality Incremental advance
AI Analysis

This provides a scalable tool for detailed error analysis in text generation tasks like machine translation and summarization, enabling targeted improvements, though it is incremental as it builds on existing metrics.

The authors tackled the problem of coarse-grained evaluation in text generation by proposing MuLER, a method that transforms any reference-based metric into a fine-grained analysis tool to quantify penalties for specific error types, such as nouns and verbs in machine translation, and found that these parts-of-speech are among the hardest to translate despite being frequent.

We propose a novel methodology (namely, MuLER) that transforms any reference-based evaluation metric for text generation, such as machine translation (MT) into a fine-grained analysis tool. Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types (e.g., errors in translating names of locations). MuLER thus enables a detailed error analysis which can lead to targeted improvement efforts for specific phenomena. We perform experiments in both synthetic and naturalistic settings to support MuLER's validity and showcase its usability in MT evaluation, and other tasks, such as summarization. Analyzing all submissions to WMT in 2014-2020, we find consistent trends. For example, nouns and verbs are among the most frequent POS tags. However, they are among the hardest to translate. Performance on most POS tags improves with overall system performance, but a few are not thus correlated (their identity changes from language to language). Preliminary experiments with summarization reveal similar trends.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes