CLFeb 19, 2025

Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh

Nurkhan Laiyk, Daniil Orel, Rituraj Joshi, Maiya Goloburda, Yuxia Wang, Preslav Nakov, Fajri Koto

arXiv:2502.13647v210.98 citationsh-index: 64Has CodeACL

Originality Synthesis-oriented

AI Analysis

This work addresses the lack of instruction-following data for low-resource languages like Kazakh, which is incremental as it applies existing LLM-assisted methods to a new domain and language.

The paper tackled the problem of instruction tuning for low-resource languages by creating a large-scale, manually verified dataset of 10,600 samples in Kazakh, focusing on government and cultural domains, and showed that fine-tuning models like Qwen, Falcon, and Gemma on this dataset led to consistent performance improvements in multiple-choice and generative tasks.

Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs' understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.

View on arXiv PDF

Similar