CLOct 27, 2025

A Survey on LLM Mid-Training

arXiv:2510.23081v212 citationsh-index: 7
Originality Synthesis-oriented
AI Analysis

This is an incremental survey paper that organizes and clarifies the emerging concept of mid-training for LLM researchers and practitioners.

This survey formally defines mid-training as a distinct stage between pre-training and post-training for large language models (LLMs), analyzing how it systematically enhances capabilities like mathematics and coding while maintaining foundational competencies. It provides a comprehensive taxonomy and optimization frameworks covering data curation and training strategies to support future LLM research.

Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes