Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting
This work addresses the inclusivity gap in LLM advancements for Kazakh speakers, though it is incremental as it adapts an existing model to a new language setting.
The researchers tackled the problem of limited LLM support for Kazakh speakers by developing Sherkala-Chat, an instruction-tuned model adapted from LLaMA-3.1-8B, which significantly outperforms existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English.
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outper-forming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. To ensure effective and responsible alignment, we leverage translated instruction datasets, a Kazakhstan-specific instruction dataset that is automatically constructed and manually verified, and Kazakh-specific safety data. We release Sherkala-Chat (8B) as an open-weight model, along with a detailed description of its training, alignment, and evaluation, to support research and real-world applications for Kazakh speakers.