CLAIMar 27, 2025

COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing

arXiv:2503.21670v39 citationsh-index: 4Has CodeEMNLP
Originality Synthesis-oriented
AI Analysis

This addresses a data scarcity problem for researchers and practitioners working on multilingual NLP in code-mixed languages like Hinglish, though it is incremental as it builds on existing annotation efforts.

The authors tackled the lack of high-quality annotated data for Hindi-English code-mixed NLP by creating COMI-LINGUA, a large-scale dataset with 125K+ instances across five tasks, which enabled fine-tuning LLMs to achieve up to 95.25 F1 in NER and 98.77 F1 in MLI.

We introduce COMI-LINGUA, the largest manually annotated Hindi-English code-mixed dataset, comprising 125K+ high-quality instances across five core NLP tasks: Matrix Language Identification, Token-level Language Identification, Part-Of-Speech Tagging, Named Entity Recognition, and Machine Translation. Each instance is annotated by three bilingual annotators, yielding over 376K expert annotations with strong inter-annotator agreement (Fleiss' Kappa $\geq$ 0.81). The rigorously preprocessed and filtered dataset covers both Devanagari and Roman scripts and spans diverse domains, ensuring real-world linguistic coverage. Evaluation reveals that closed-source LLMs significantly outperform traditional tools and open-source models in zero-shot settings. Notably, one-shot prompting consistently boosts performance across tasks, especially in structure-sensitive predictions like POS and NER. Fine-tuning state-of-the-art LLMs on COMI-LINGUA demonstrates substantial improvements, achieving up to 95.25 F1 in NER, 98.77 F1 in MLI, and competitive MT performance, setting new benchmarks for Hinglish code-mixed text. COMI-LINGUA is publicly available at this URL: https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes