A Survey on Efficient Large Language Model Training: From Data-centric Perspectives
This is an incremental survey that organizes existing research to help researchers and practitioners improve data utilization in LLM training, but it does not introduce new methods or results.
This paper presents a systematic survey on data-efficient post-training for Large Language Models (LLMs), addressing challenges like high annotation costs and diminishing returns from data scaling by proposing a taxonomy covering methods such as data selection, quality enhancement, and synthetic data generation.
Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM