CLApr 24, 2019

Detecting Machine-Translated Paragraphs by Matching Similar Words

arXiv:1904.10641v18 citations
Originality Incremental advance
AI Analysis

This addresses the problem of preventing misunderstandings from unnatural translations for users and systems relying on machine-translated text, representing a strong incremental improvement over existing detection methods.

The paper tackled detecting machine-translated paragraphs by developing a coherence-based method that matches similar words throughout a paragraph, achieving high performance with accuracies of 87.0% on English, 89.2% on Dutch, and 97.9% on Japanese, outperforming previous methods.

Machine-translated text plays an important role in modern life by smoothing communication from various communities using different languages. However, unnatural translation may lead to misunderstanding, a detector is thus needed to avoid the unfortunate mistakes. While a previous method measured the naturalness of continuous words using a N-gram language model, another method matched noncontinuous words across sentences but this method ignores such words in an individual sentence. We have developed a method matching similar words throughout the paragraph and estimating the paragraph-level coherence, that can identify machine-translated text. Experiment evaluates on 2000 English human-generated and 2000 English machine-translated paragraphs from German showing that the coherence-based method achieves high performance (accuracy = 87.0%; equal error rate = 13.0%). It is efficiently better than previous methods (best accuracy = 72.4%; equal error rate = 29.7%). Similar experiments on Dutch and Japanese obtain 89.2% and 97.9% accuracy, respectively. The results demonstrate the persistence of the proposed method in various languages with different resource levels.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes