CLAIMar 26, 2024

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

arXiv:2403.18058v257 citationsh-index: 28Has CodeNAACL
Originality Synthesis-oriented
AI Analysis

This addresses the problem of improving Chinese instruction tuning for LLM users, but it is incremental as it adapts existing methods to a new linguistic domain.

The authors tackled the lack of high-quality Chinese instruction tuning datasets by introducing COIG-CQIA, a dataset derived from real-world resources with human verification, which led to models achieving highly competitive performance in diverse benchmarks.

Remarkable progress on English instruction tuning has facilitated the efficacy and reliability of large language models (LLMs). However, there remains a noticeable gap in instruction tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users' interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world resources and undergoing rigorous human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese instruction-tuning datasets and data-mixing strategies. Our dataset are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes