CLJun 12, 2021

Machine Translation into Low-resource Language Varieties

arXiv:2106.06797v2717 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of providing NLP solutions for low-resource language varieties, which are often excluded from standard systems, though it is incremental as it builds on existing MT frameworks.

The paper tackles the problem of adapting machine translation systems to generate low-resource language varieties without parallel data, achieving significant improvements in experiments with languages like Ukrainian, Belarusian, Nynorsk, and Arabic dialects.

State-of-the-art machine translation (MT) systems are typically trained to generate the "standard" target language; however, many languages have multiple varieties (regional varieties, dialects, sociolects, non-native varieties) that are different from the standard language. Such varieties are often low-resource, and hence do not benefit from contemporary NLP solutions, MT included. We propose a general framework to rapidly adapt MT systems to generate language varieties that are close to, but different from, the standard target language, using no parallel (source--variety) data. This also includes adaptation of MT systems to low-resource typologically-related target languages. We experiment with adapting an English--Russian MT system to generate Ukrainian and Belarusian, an English--Norwegian Bokmål system to generate Nynorsk, and an English--Arabic system to generate four Arabic dialects, obtaining significant improvements over competitive baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes