Linguistic evaluation of German-English Machine Translation using a Test Suite
This provides a detailed linguistic evaluation for MT researchers and developers, but it is incremental as it builds on existing test suite methods.
The researchers evaluated German-to-English machine translation systems from WMT19 using a grammatical test suite covering 107 phenomena, finding that systems still incorrectly translate one out of four test items on average, with particularly low performance in areas like idioms and verb valency.
We present the results of the application of a grammatical test suite for German$\rightarrow$English MT on the systems submitted at WMT19, with a detailed analysis for 107 phenomena organized in 14 categories. The systems still translate wrong one out of four test items in average. Low performance is indicated for idioms, modals, pseudo-clefts, multi-word expressions and verb valency. When compared to last year, there has been a improvement of function words, non-verbal agreement and punctuation. More detailed conclusions about particular systems and phenomena are also presented.