CLAIMay 9

SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization

arXiv:2605.0880974.7
AI Analysis

This work addresses the inefficiency of representation learning in large-scale LLM pretraining, offering a practical regularization method that accelerates training and improves performance.

SimReg introduces an embedding similarity regularization loss for LLM pretraining that reduces intra-class variance and inter-class similarity, achieving over 30% faster training convergence and over 1% improvement in zero-shot downstream performance across dense and MoE architectures.

Pretraining large language models (LLMs) with next-token prediction has led to remarkable advances, yet the context-dependent nature of token embeddings in such models results in high intra-class variance and inter-class similarity, thus hindering the efficiency of representation learning. While similarity-based regularization has demonstrated benefit in supervised fine-tuning and classification tasks, its application and efficacy in large-scale LLM pretraining remains underexplored. In this work, we propose the SimReg, an embedding similarity regularization loss that explicitly encourages token representations with the same ground-truth label within each sequence to be more similar, while enforcing separation from different-label tokens via a contrastive loss. Our analysis reveals that this mechanism introduces gains by enlarging multi-classification margins, thereby enabling more efficient classification. Extensive experiments across dense and Mixture-of-Experts (MoE) architectures demonstrate that SimReg consistently accelerates training convergence by over 30% and improves average zero-shot downstream performance by over 1% across standard benchmarks. Further ablation studies and analyses offer practical insights into hyperparameter tuning and loss effectiveness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes