CLApr 14, 2025

Transferable text data distillation by trajectory matching

arXiv:2504.09818v2h-index: 19
Originality Incremental advance
AI Analysis

This addresses the problem of reducing data size for LLM training, particularly in text generation tasks like instruction tuning, and is incremental as it adapts data distillation from computer vision to NLP.

The paper tackles the high training cost of large language models by proposing a data distillation method that synthesizes a small number of text samples to match full dataset performance, achieving superior results over the SOTA data selection method LESS on benchmarks like ARC-Easy and MMLU.

In the realm of large language model (LLM), as the size of large models increases, it also brings higher training costs. There is a urgent need to minimize the data size in LLM training. Compared with data selection method, the data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set and has better flexibility. Despite its successes in computer vision, the discreteness of text data has hitherto stymied its exploration in natural language processing (NLP). In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching and finding its nearest neighbor ID to achieve cross-architecture transfer. During the distillation process, we introduce a regularization loss to improve the robustness of our distilled data. To our best knowledge, this is the first data distillation work suitable for text generation tasks such as instruction tuning. Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS. Furthermore, our method demonstrates a good transferability over LLM structures (i.e., OPT to Llama).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes