CL LGJun 2, 2025

Echoes of BERT: Do Modern Language Models Rediscover the Classical NLP Pipeline?

CMU

arXiv:2506.02132v412.06 citationsh-index: 13Has Code

Originality Synthesis-oriented

AI Analysis

This work provides insights into how modern language models encode linguistic information, which is important for researchers and practitioners in NLP, though it is incremental as it builds on earlier BERTology studies.

The paper analyzed 25 language models, from classical to modern, across eight linguistic tasks, finding that hierarchical organization persists with early layers capturing syntax, middle layers handling semantics, and later layers encoding discourse, and that lexical information becomes nonlinear in deeper layers while inflectional information remains linear throughout.

Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information relies primarily on studies of early models like BERT and GPT-2. Building on classic BERTology work, we analyze 25 models spanning from classical architectures (BERT, DeBERTa, GPT-2) to modern large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1), probing layer-by-layer representations across eight linguistic tasks in English. Consistent with earlier findings, we find that hierarchical organization persists in modern models: early layers capture syntax, middle layers handle semantics and entity-level information, and later layers encode discourse phenomena. We dive deeper, conducting an in-depth multilingual analysis of two specific linguistic properties - lexical identity and inflectional morphology - that help disentangle form from meaning. We find that lexical information concentrates linearly in early layers but becomes increasingly nonlinear deeper in the network, while inflectional information remains linearly accessible throughout all layers. Additional analyses of attention mechanisms, steering vectors, and pretraining checkpoints reveal where this information resides within layers, how it can be functionally manipulated, and how representations evolve during pretraining. Taken together, our findings suggest that, even with substantial advances in LLM technologies, transformer models learn to organize linguistic information in similar ways, regardless of model architecture, size, or training regime, indicating that these properties are important for next token prediction. Our code is available at https://github.com/ml5885/model_internal_sleuthing

View on arXiv PDF Code

Similar