CLOct 15, 2019

Detecting Machine-Translated Text using Back Translation

Hoang-Quoc Nguyen-Son, Tran Phuong Thao, Seira Hidano, Shinsaku Kiyomoto

arXiv:1910.06558v130.0995 citations

Originality Incremental advance

AI Analysis

This addresses the problem of detecting machine-translated text for malicious uses like plagiarism and fake reviews, but it is incremental as it builds on back-translation techniques.

The paper tackles the problem of detecting machine-translated text, which is challenging when it has the same meaning as human-written text, by proposing a method that uses similarity between the text and its back-translation as features. The method achieves 75.0% accuracy and F-score on French sentences, outperforming existing methods with 62.8% accuracy, and shows similar results for Japanese and back-translated text with 83.4% accuracy.

Machine-translated text plays a crucial role in the communication of people using different languages. However, adversaries can use such text for malicious purposes such as plagiarism and fake review. The existing methods detected a machine-translated text only using the text's intrinsic content, but they are unsuitable for classifying the machine-translated and human-written texts with the same meanings. We have proposed a method to extract features used to distinguish machine/human text based on the similarity between the intrinsic text and its back-translation. The evaluation of detecting translated sentences with French shows that our method achieves 75.0% of both accuracy and F-score. It outperforms the existing methods whose the best accuracy is 62.8% and the F-score is 62.7%. The proposed method even detects more efficiently the back-translated text with 83.4% of accuracy, which is higher than 66.7% of the best previous accuracy. We also achieve similar results not only with F-score but also with similar experiments related to Japanese. Moreover, we prove that our detector can recognize both machine-translated and machine-back-translated texts without the language information which is used to generate these machine texts. It demonstrates the persistence of our method in various applications in both low- and rich-resource languages.

View on arXiv PDF

Similar