CL AIMay 23, 2023

LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

Milind Agarwal, Md Mahfuz Ibn Alam, Antonios Anastasopoulos

arXiv:2305.14263v221.5135 citationsHas Code

Originality Highly original

AI Analysis

This addresses the problem of inaccurate language identification for low-resource languages, enabling better use of NLP tools, and is incremental as it builds on existing methods with a novel misprediction-resolution approach.

The paper tackles the bottleneck of language identification for most of the world's 7000 languages by compiling a new corpus, MCS-350, and proposing a hierarchical model, LIMIt, which reduces error by 55% on their dataset and 40% on FLORES-200.

Knowing the language of an input text/audio is a necessary first step for using almost every NLP tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, due to lack of data and computational challenges, current systems cannot accurately identify most of the world's 7000 languages. To tackle this bottleneck, we first compile a corpus, MCS-350, of 50K multilingual and parallel children's stories in 350+ languages. MCS-350 can serve as a benchmark for language identification of short texts and for 1400+ new translation directions in low-resource Indian and African languages. Second, we propose a novel misprediction-resolution hierarchical model, LIMIt, for language identification that reduces error by 55% (from 0.71 to 0.32) on our compiled children's stories dataset and by 40% (from 0.23 to 0.14) on the FLORES-200 benchmark. Our method can expand language identification coverage into low-resource languages by relying solely on systemic misprediction patterns, bypassing the need to retrain large models from scratch.

View on arXiv PDF Code

Similar