Fine-grained human evaluation of neural versus phrase-based machine translation
This provides a detailed evaluation for the machine translation community, showing neural methods' superiority but is incremental as it builds on existing comparison frameworks.
The study compared three statistical machine translation approaches—pure phrase-based, factored phrase-based, and neural—through fine-grained manual error annotation using MQM standards, finding that the neural system reduced errors by 54% compared to the worst-performing phrase-based system.
We compare three approaches to statistical machine translation (pure phrase-based, factored phrase-based and neural) by performing a fine-grained manual evaluation via error annotation of the systems' outputs. The error types in our annotation are compliant with the multidimensional quality metrics (MQM), and the annotation is performed by two annotators. Inter-annotator agreement is high for such a task, and results show that the best performing system (neural) reduces the errors produced by the worst system (phrase-based) by 54%.