CLApr 21, 2021

On User Interfaces for Large-Scale Document-Level Human Evaluation of Machine Translation Outputs

Roman Grundkiewicz, Marcin Junczys-Dowmunt, Christian Federmann, Tom Kocmi

arXiv:2104.10408v132.7803 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of reliable large-scale human evaluation for machine translation researchers and practitioners, though it is incremental as it builds on existing document-level evaluation efforts.

The study tackled the problem of how user interfaces affect human evaluation of machine translation by comparing two document-level evaluation methods from WMT campaigns, finding that a document-centric approach improves assessment quality and inter-annotator agreement but increases annotation time.

Recent studies emphasize the need of document context in human evaluation of machine translations, but little research has been done on the impact of user interfaces on annotator productivity and the reliability of assessments. In this work, we compare human assessment data from the last two WMT evaluation campaigns collected via two different methods for document-level evaluation. Our analysis shows that a document-centric approach to evaluation where the annotator is presented with the entire document context on a screen leads to higher quality segment and document level assessments. It improves the correlation between segment and document scores and increases inter-annotator agreement for document scores but is considerably more time consuming for annotators.

View on arXiv PDF

Similar