CL AIAug 4, 2025

TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models

Fan Gao, Cheng Huang, Nyima Tashi, Yutong Liu, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng

arXiv:2508.01977v19.63 citationsh-index: 17Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of low-resource language processing for Tibetan speakers, enabling more inclusive AI through dataset creation and model innovation.

The authors tackled data scarcity in Tibetan, a low-resource language, by introducing TIBSTC-CoT, a large-scale multi-domain dataset automatically constructed using chain-of-thought prompting with LLMs, and developed the Sunshine-thinking LLM family trained on this dataset, which demonstrated strong reasoning and generation performance comparable to SOTA multilingual LLMs.

To address the severe data scarcity in Tibetan, a low-resource language spoken by over six million people, we introduce TIBSTC-CoT, the large-scale, multi-domain Tibetan dataset automatically constructed via chain-of-thought prompting with large language models (LLMs). TIBSTC-CoT establishes a scalable and reproducible framework for dataset creation in low-resource settings, covering diverse domains and reasoning patterns essential for language understanding and generation. Building on this dataset, we develop the Sunshine-thinking LLM family, a series of Tibetan-centric LLMs equipped with chain-of-thought capabilities. Trained entirely on TIBSTC-CoT, Sunshine-thinking has demonstrated strong reasoning and generation performance, comparable to state-of-the-art (SOTA) multilingual LLMs. Our work marks a significant step toward inclusive AI by enabling high-quality Tibetan language processing through both resource creation and model innovation. All data are available: https://github.com/Vicentvankor/sun-shine.

View on arXiv PDF Code

Similar