CLSep 21, 2025

CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource Languages

arXiv:2509.16914v120 citationsh-index: 3Has CodeCOLING
Originality Synthesis-oriented
AI Analysis

This addresses the scarcity of training data for low-resource languages like Uyghur and Tibetan, enabling better NLP support, though it is incremental as it builds on existing machine translation methods.

The paper tackles the problem of inadequate support for low-resource languages in large language models by constructing CUTE, a multilingual dataset including Chinese, Uyghur, Tibetan, and English, which enhances cross-lingual knowledge transfer and is the largest open-source corpus for Uyghur and Tibetan.

Large Language Models (LLMs) demonstrate exceptional zero-shot capabilities in various NLP tasks, significantly enhancing user experience and efficiency. However, this advantage is primarily limited to resource-rich languages. For the diverse array of low-resource languages, support remains inadequate, with the scarcity of training corpora considered the primary cause. We construct and open-source CUTE Chinese, Uyghur, Tibetan,English dataset, consisting of two 25GB sets of four-language corpora (one parallel and one non-parallel), obtained through machine translation. CUTE encompasses two resource-rich languages (Chinese and English) and two low-resource languages (Uyghur and Tibetan). Prior to constructing CUTE, human assessment validates that the machine translation quality between Chinese-Uyghur and Chinese-Tibetan approaches that of Chinese-English translation. CUTE represents the largest open-source corpus for Uyghur and Tibetan languages to date, and we demonstrate its effectiveness in enhancing LLMs' ability to process low-resource languages while investigating the role of corpus parallelism in cross-lingual transfer learning. The CUTE corpus and related models are made publicly available to the research community.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes