CLJan 26

Suppressing Final Layer Hidden State Jumps in Transformer Pretraining

arXiv:2601.18302v12 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses an internal behavior problem in Transformer models for NLP researchers, though it is incremental as it builds on existing pretraining methods.

The paper tackles the issue of disproportionately large angular distance jumps in the final layer of Transformer language models during pretraining, proposing a jump-suppressing regularizer (JREG) that improves task performance across three model sizes of Llama-based models without architectural changes.

This paper discusses the internal behavior of Transformer language models. Many recent pre-trained models have been reported to exhibit only slight changes in the angular distance between the input and output hidden state vectors in the middle Transformer layers, despite a disproportionately large ``jump'' in the angular distance occurring in or around the final Transformer layer. To characterize this, we first introduce a quantitative metric for the jump strength around the final layer, and then demonstrate its prevalence across many open-weight models, as well as its amplification throughout pre-training. Assuming such jumps indicate an undesirable property, we propose the jump-suppressing regularizer (JREG) which penalizes this jump during pre-training, thereby encouraging more balanced capability usage across the middle layers. Empirical evaluations of three model sizes of Llama-based models, trained with the proposed JREG method, reveal improved task performance compared to the baseline without altering the model architecture.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes