Detecting Machine-Translated Paragraphs by Matching Similar Words
This addresses the problem of preventing misunderstandings from unnatural translations for users and systems relying on machine-translated text, representing a strong incremental improvement over existing detection methods.
The paper tackled detecting machine-translated paragraphs by developing a coherence-based method that matches similar words throughout a paragraph, achieving high performance with accuracies of 87.0% on English, 89.2% on Dutch, and 97.9% on Japanese, outperforming previous methods.
Machine-translated text plays an important role in modern life by smoothing communication from various communities using different languages. However, unnatural translation may lead to misunderstanding, a detector is thus needed to avoid the unfortunate mistakes. While a previous method measured the naturalness of continuous words using a N-gram language model, another method matched noncontinuous words across sentences but this method ignores such words in an individual sentence. We have developed a method matching similar words throughout the paragraph and estimating the paragraph-level coherence, that can identify machine-translated text. Experiment evaluates on 2000 English human-generated and 2000 English machine-translated paragraphs from German showing that the coherence-based method achieves high performance (accuracy = 87.0%; equal error rate = 13.0%). It is efficiently better than previous methods (best accuracy = 72.4%; equal error rate = 29.7%). Similar experiments on Dutch and Japanese obtain 89.2% and 97.9% accuracy, respectively. The results demonstrate the persistence of the proposed method in various languages with different resource levels.