CLOct 27, 2025

A Survey on LLM Mid-Training

Chengying Tu, Xuemiao Zhang, Rongxiang Weng, Rumei Li, Chen Zhang, Yang Bai, Hongfei Yan, Jingang Wang, Xunliang Cai

arXiv:2510.23081v212 citationsh-index: 7

Originality Synthesis-oriented

AI Analysis

This is an incremental survey paper that organizes and clarifies the emerging concept of mid-training for LLM researchers and practitioners.

This survey formally defines mid-training as a distinct stage between pre-training and post-training for large language models (LLMs), analyzing how it systematically enhances capabilities like mathematics and coding while maintaining foundational competencies. It provides a comprehensive taxonomy and optimization frameworks covering data curation and training strategies to support future LLM research.

Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.

View on arXiv PDF

Similar