Fine-grained evaluation of Quality Estimation for Machine translation based on a linguistically-motivated Test Suite
This work addresses the need for more informative evaluation methods in machine translation quality estimation, though it is incremental as it builds on existing QE frameworks.
The paper tackled the problem of evaluating Quality Estimation (QE) systems for machine translation by introducing a linguistically-motivated Test Suite with 14 error categories, and found that detailed performance analysis across these categories revealed significant differences among five QE systems, confirming the suite's utility.
We present an alternative method of evaluating Quality Estimation systems, which is based on a linguistically-motivated Test Suite. We create a test-set consisting of 14 linguistic error categories and we gather for each of them a set of samples with both correct and erroneous translations. Then, we measure the performance of 5 Quality Estimation systems by checking their ability to distinguish between the correct and the erroneous translations. The detailed results are much more informative about the ability of each system. The fact that different Quality Estimation systems perform differently at various phenomena confirms the usefulness of the Test Suite.