The Power of Power Law: Asymmetry Enables Compositional Reasoning
For researchers training language models, this challenges the common intuition that uniform data distributions are optimal for learning long-tail skills.
The paper shows that training on power-law distributed data outperforms uniform data on compositional reasoning tasks like state tracking and multi-step arithmetic, requiring less data due to beneficial asymmetry in the loss landscape.
Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills. Our results offer an alternative perspective on what constitutes an effective data distribution for training models.