On the Evaluation of Machine Translation for Terminology Consistency
This work addresses the need for reliable evaluation tools in professional translation pipelines, particularly for domain adaptation, though it is incremental as it builds on existing terminology integration efforts.
The authors tackled the problem of evaluating machine translation systems for adherence to domain-specific terminologies, proposing new metrics and validating them through studies on the COVID-19 domain across 5 languages, including human evaluation.
As neural machine translation (NMT) systems become an important part of professional translator pipelines, a growing body of work focuses on combining NMT with terminologies. In many scenarios and particularly in cases of domain adaptation, one expects the MT output to adhere to the constraints provided by a terminology. In this work, we propose metrics to measure the consistency of MT output with regards to a domain terminology. We perform studies on the COVID-19 domain over 5 languages, also performing terminology-targeted human evaluation. We open-source the code for computing all proposed metrics: https://github.com/mahfuzibnalam/terminology_evaluation