TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling
This addresses the lack of resources for Tibetan language modeling, promoting inclusivity in multilingual NLP, though it is incremental as it applies existing methods to new data.
The authors tackled the problem of uneven progress in large language models for low-resource languages by creating TIB-STC, a large-scale Tibetan dataset with over 11 billion tokens, and demonstrated its effectiveness by training a reference model that performed well on Tibetan-specific benchmarks.
Advancement of large language models (LLMs) has brought transformative capabilities to NLP, but such progress remains unevenly distributed, especially for low-resource and culturally rich languages like Tibetan. In this paper, we present TIB-STC, the first large-scale, expert-curated, and multi-domain dataset specifically designed to support the development and evaluation of LLMs for the Tibetan language. Spanning over 11 billion tokens across literature, religion, medicine, law, and daily communication, TIB-STC preserves traditional grammar and stylistic richness. To validate its utility, we train a reference model, Sun-Shine, on TIB-STC through a three-stage pipeline involving pretraining, supervised fine-tuning, and preference optimization. Evaluation on TLUE Benchmark for Tibetan-specific tasks, including Ti-MMLU and Ti-SafetyBench, demonstrates the TIB-STC's effectiveness in enabling robust instruction-following and culturally aligned generation. We release TIB-STC to advance research in low-resource language modeling and promote inclusivity in multilingual NLP. All data are available: https://github.com/Vicentvankor/sun-shine.