CLApr 30, 2025

Improving Informally Romanized Language Identification

arXiv:2504.21540v31 citationsh-index: 20EMNLP
Originality Synthesis-oriented
AI Analysis

This work addresses a domain-specific problem for natural language processing in multilingual contexts, particularly for Indic languages, by enhancing LID accuracy in romanized text, representing an incremental improvement over existing methods.

The paper tackles the problem of language identification for informally romanized text, where high spelling variability makes languages like Hindi and Urdu confusable, and improves accuracy by synthesizing training sets with natural spelling variation, achieving new state-of-the-art performance with test F1 increasing from 74.7% to 85.4% using synthetic data and 88.2% with additional harvested text.

The Latin script is often used to informally write languages with non-Latin native scripts. In many cases (e.g., most languages in India), the lack of conventional spelling in the Latin script results in high spelling variability. Such romanization renders languages that are normally easily distinguished due to being written in different scripts - Hindi and Urdu, for example - highly confusable. In this work, we increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets. We find that training on synthetic samples which incorporate natural spelling variation yields higher LID system accuracy than including available naturally occurring examples in the training set, or even training higher capacity models. We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set (Madhani et al., 2023a), improving test F1 from the reported 74.7% (using a pretrained neural model) to 85.4% using a linear classifier trained solely on synthetic data and 88.2% when also training on available harvested text.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes