AICLLGApr 24

The Power of Power Law: Asymmetry Enables Compositional Reasoning

arXiv:2604.2295185.6
AI Analysis

For researchers training language models, this challenges the common intuition that uniform data distributions are optimal for learning long-tail skills.

The paper shows that training on power-law distributed data outperforms uniform data on compositional reasoning tasks like state tracking and multi-step arithmetic, requiring less data due to beneficial asymmetry in the loss landscape.

Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills. Our results offer an alternative perspective on what constitutes an effective data distribution for training models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes