CL AIMar 26, 2024

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Junting Zhou, Ziqiang Liu, Feiteng Fang, Mingshan Chang, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang

arXiv:2403.18058v219.857 citationsh-index: 43Has CodeNAACL

Originality Synthesis-oriented

AI Analysis

This addresses the problem of improving Chinese instruction tuning for LLM users, but it is incremental as it adapts existing methods to a new linguistic domain.

The authors tackled the lack of high-quality Chinese instruction tuning datasets by introducing COIG-CQIA, a dataset derived from real-world resources with human verification, which led to models achieving highly competitive performance in diverse benchmarks.

Remarkable progress on English instruction tuning has facilitated the efficacy and reliability of large language models (LLMs). However, there remains a noticeable gap in instruction tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users' interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world resources and undergoing rigorous human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese instruction-tuning datasets and data-mixing strategies. Our dataset are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA.

View on arXiv PDF

Similar