CLJun 24, 2021

On the Influence of Machine Translation on Language Origin Obfuscation

arXiv:2106.12830v1
Originality Incremental advance
AI Analysis

This addresses privacy and security concerns in multilingual digital content by revealing limitations in language obfuscation through machine translation.

The paper investigates whether machine translation systems can effectively hide the original language of text by analyzing the detectability of source languages from translated outputs using basic textual features. Results show high accuracy in reconstructing source languages when documents contain sufficient translated text, with performance varying by document size and language set constraints.

In the last decade, machine translation has become a popular means to deal with multilingual digital content. By providing higher quality translations, obfuscating the source language of a text becomes more attractive. In this paper, we analyze the ability to detect the source language from the translated output of two widely used commercial machine translation systems by utilizing machine-learning algorithms with basic textual features like n-grams. Evaluations show that the source language can be reconstructed with high accuracy for documents that contain a sufficient amount of translated text. In addition, we analyze how the document size influences the performance of the prediction, as well as how limiting the set of possible source languages improves the classification accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes