CLSep 23, 2025

Human-Annotated NER Dataset for the Kyrgyz Language

arXiv:2509.19109v11 citationsh-index: 32025 10th International Conference on Computer Science and Engineering (UBMK)
Originality Synthesis-oriented
AI Analysis

This provides the first NER dataset for Kyrgyz, addressing a resource gap for low-resource language processing, but it is incremental as it applies existing methods to new data.

The authors tackled the lack of named entity recognition (NER) resources for the Kyrgyz language by creating KyrgyzNER, a manually annotated dataset with 39,075 entity mentions from 1,499 news articles, and evaluated models including multilingual RoBERTa, which achieved a promising balance between precision and recall.

We introduce KyrgyzNER, the first manually annotated named entity recognition dataset for the Kyrgyz language. Comprising 1,499 news articles from the 24.KG news portal, the dataset contains 10,900 sentences and 39,075 entity mentions across 27 named entity classes. We show our annotation scheme, discuss the challenges encountered in the annotation process, and present the descriptive statistics. We also evaluate several named entity recognition models, including traditional sequence labeling approaches based on conditional random fields and state-of-the-art multilingual transformer-based models fine-tuned on our dataset. While all models show difficulties with rare entity categories, models such as the multilingual RoBERTa variant pretrained on a large corpus across many languages achieve a promising balance between precision and recall. These findings emphasize both the challenges and opportunities of using multilingual pretrained models for processing languages with limited resources. Although the multilingual RoBERTa model performed best, other multilingual models yielded comparable results. This suggests that future work exploring more granular annotation schemes may offer deeper insights for Kyrgyz language processing pipelines evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes