On Implications of Scaling Laws on Feature Superposition
This addresses a foundational problem in understanding neural network representations for researchers in interpretability and theory, but it is incremental as it builds on existing scaling law results.
This theoretical note uses scaling laws to argue that the superposition hypothesis, where sparse features are linearly represented, cannot be a complete theory of feature representation if features are universal across models with equal performance.
Using results from scaling laws, this theoretical note argues that the following two statements cannot be simultaneously true: 1. Superposition hypothesis where sparse features are linearly represented across a layer is a complete theory of feature representation. 2. Features are universal, meaning two models trained on the same data and achieving equal performance will learn identical features.