CLFeb 13

OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

arXiv:2602.13139v22 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving language identification precision for closely related languages and noise filtering, which is crucial for building high-quality multilingual datasets, particularly benefiting low-resource language communities, though it is incremental as it builds on existing OpenLID.

The authors tackled the problem of low precision in identifying closely related languages and distinguishing natural language from noise in language identification tools, which contaminates datasets, especially for low-resource languages. They developed OpenLID-v3 by extending OpenLID with more training data, merging language variant clusters, and adding a noise label, and found that ensemble approaches improve precision but reduce coverage for low-resource languages.

Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes