Can Your Context-Aware MT System Pass the DiP Benchmark Tests? : Evaluation Benchmarks for Discourse Phenomena in Machine Translation
This addresses the problem of evaluating discourse-related translation quality for researchers and developers in machine translation, though it is incremental as it focuses on benchmarking rather than a new method.
The authors tackled the lack of evidence for translation quality improvements in context-aware machine translation systems, especially for discourse phenomena, by introducing the first benchmark datasets to evaluate four main discourse phenomena and finding that existing models do not consistently improve across languages and phenomena.
Despite increasing instances of machine translation (MT) systems including contextual information, the evidence for translation quality improvement is sparse, especially for discourse phenomena. Popular metrics like BLEU are not expressive or sensitive enough to capture quality improvements or drops that are minor in size but significant in perception. We introduce the first of their kind MT benchmark datasets that aim to track and hail improvements across four main discourse phenomena: anaphora, lexical consistency, coherence and readability, and discourse connective translation. We also introduce evaluation methods for these tasks, and evaluate several baseline MT systems on the curated datasets. Surprisingly, we find that existing context-aware models do not improve discourse-related translations consistently across languages and phenomena.