CLAISep 21, 2025

Evolution of Concepts in Language Model Pre-Training

arXiv:2509.17196v17 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work incrementally advances understanding of the black-box pre-training process for language models by providing fine-grained tracking of representation dynamics.

The researchers tracked interpretable feature evolution during language model pre-training using crosscoders, finding that features form at specific points and complex patterns emerge later, with feature attribution showing causal connections to downstream performance.

Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes