CLAIMay 23, 2023

LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

arXiv:2305.14263v2135 citations
Originality Highly original
AI Analysis

This addresses the problem of inaccurate language identification for low-resource languages, enabling better use of NLP tools, and is incremental as it builds on existing methods with a novel misprediction-resolution approach.

The paper tackles the bottleneck of language identification for most of the world's 7000 languages by compiling a new corpus, MCS-350, and proposing a hierarchical model, LIMIt, which reduces error by 55% on their dataset and 40% on FLORES-200.

Knowing the language of an input text/audio is a necessary first step for using almost every NLP tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, due to lack of data and computational challenges, current systems cannot accurately identify most of the world's 7000 languages. To tackle this bottleneck, we first compile a corpus, MCS-350, of 50K multilingual and parallel children's stories in 350+ languages. MCS-350 can serve as a benchmark for language identification of short texts and for 1400+ new translation directions in low-resource Indian and African languages. Second, we propose a novel misprediction-resolution hierarchical model, LIMIt, for language identification that reduces error by 55% (from 0.71 to 0.32) on our compiled children's stories dataset and by 40% (from 0.23 to 0.14) on the FLORES-200 benchmark. Our method can expand language identification coverage into low-resource languages by relying solely on systemic misprediction patterns, bypassing the need to retrain large models from scratch.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes