LG AI CLMay 15, 2025

Superposition Yields Robust Neural Scaling

arXiv:2505.10465v326.424 citationsh-index: 7Has Code

Originality Highly original

AI Analysis

This addresses a fundamental problem in understanding why larger models perform better, with implications for predicting and improving scaling laws in AI.

The paper investigates the origin of neural scaling laws in large language models, proposing that representation superposition (where models represent more features than dimensions) drives loss scaling inversely with model dimension, and confirms this behavior in open-sourced LLMs and Chinchilla scaling laws.

The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law, that loss decreases as a power law with model size, remains unclear. We propose that representation superposition, meaning that LLMs represent more features than they have dimensions, can be a key contributor to loss and cause neural scaling. Based on Anthropic's toy model, we use weight decay to control the degree of superposition, allowing us to systematically study how loss scales with model size. When superposition is weak, the loss follows a power law only if data feature frequencies are power-law distributed. In contrast, under strong superposition, the loss generically scales inversely with model dimension across a broad class of frequency distributions, due to geometric overlaps between representation vectors. We confirmed that open-sourced LLMs operate in the strong superposition regime and have loss scaling like one over the model dimension, and that the Chinchilla scaling laws are also consistent with this behavior. Our results identify representation superposition as a central driver of neural scaling laws, providing insights into questions like when neural scaling laws can be improved and when they will break down.

View on arXiv PDF Code

Similar