CLJul 12, 2025

Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

arXiv:2507.09205v45 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of low-resource language modeling for Tibetan speakers and researchers, representing an incremental improvement through data curation and fine-tuning.

The authors tackled the underrepresentation of Tibetan in large language models by curating the largest Tibetan pre-training corpus to date and using it to enhance a multilingual base model, resulting in a model that consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across various tasks.

Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model to enhance its generative capabilities in Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that our model consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes