When Does Unsupervised Machine Translation Work?
It addresses the reliability of unsupervised MT for researchers and practitioners, highlighting critical failure points to guide future work.
The paper investigates the conditions for success and failure in unsupervised machine translation, finding that performance deteriorates with domain mismatch, script differences, and in authentic low-resource settings, with random initialization significantly affecting results.
Despite the reported success of unsupervised machine translation (MT), the field has yet to examine the conditions under which these methods succeed, and where they fail. We conduct an extensive empirical evaluation of unsupervised MT using dissimilar language pairs, dissimilar domains, diverse datasets, and authentic low-resource languages. We find that performance rapidly deteriorates when source and target corpora are from different domains, and that random word embedding initialization can dramatically affect downstream translation performance. We additionally find that unsupervised MT performance declines when source and target languages use different scripts, and observe very poor performance on authentic low-resource language pairs. We advocate for extensive empirical evaluation of unsupervised MT systems to highlight failure points and encourage continued research on the most promising paradigms.