CLJul 16, 2025

ILID: Native Script Language Identification for Indian Languages

arXiv:2507.11832v24.91 citationsh-index: 7Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the challenge of identifying Indian languages in noisy, short, and code-mixed text, which is crucial for NLP applications in India, though it is incremental as it builds on existing methods with new data.

The authors tackled language identification for 23 Indian languages and English by creating a new dataset of 250K sentences and developing baseline models, which outperformed state-of-the-art pre-trained transformer models.

The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and text summarization. The core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments. This becomes even harder in case of diverse Indian languages that exhibit lexical and phonetic similarities, but have distinct differences. Many Indian languages share the same script, making the task even more challenging. Taking all these challenges into account, we develop and release a dataset of 250K sentences consisting of 23 languages including English and all 22 official Indian languages labeled with their language identifiers, where data in most languages are newly created. We also develop and release baseline models using state-of-the-art approaches in machine learning and fine-tuning pre-trained transformer models. Our models outperforms the state-of-the-art pre-trained transformer models for the language identification task. The dataset and the codes are available at https://yashingle-ai.github.io/ILID/ and in Huggingface open source libraries.

View on arXiv PDF

Similar