KyrgyzBERT: A Compact, Efficient Language Model for Kyrgyz NLP
This addresses the problem of limited NLP resources for Kyrgyz, a low-resource language, though it is incremental as it applies an existing method to new data.
The authors tackled the lack of foundational NLP tools for Kyrgyz by introducing KyrgyzBERT, the first publicly available monolingual BERT-based model for the language, which achieves an F1-score of 0.8280 on a new sentiment analysis benchmark, competitive with a larger multilingual model.
Kyrgyz remains a low-resource language with limited foundational NLP tools. To address this gap, we introduce KyrgyzBERT, the first publicly available monolingual BERT-based language model for Kyrgyz. The model has 35.9M parameters and uses a custom tokenizer designed for the language's morphological structure. To evaluate performance, we create kyrgyz-sst2, a sentiment analysis benchmark built by translating the Stanford Sentiment Treebank and manually annotating the full test set. KyrgyzBERT fine-tuned on this dataset achieves an F1-score of 0.8280, competitive with a fine-tuned mBERT model five times larger. All models, data, and code are released to support future research in Kyrgyz NLP.