Learning curves theory for hierarchically compositional data with power-law distributed features
This work provides a theoretical foundation for understanding scaling laws in AI, particularly for language and image tasks, but it is incremental as it builds on existing theories without introducing new methods.
The paper tackles the problem of unifying two theories of neural scaling laws—one based on power-law distributed units and another on hierarchical compositional data—by analyzing classification and next-token prediction tasks using probabilistic context-free grammars. It shows that for classification, power-law distributed production rules yield a power-law learning curve with an exponent dependent on the distribution and a large constant from hierarchy, while for next-token prediction, the distribution affects local details but not the large-scale exponent.
Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into power-law distributed units. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free grammars -- probabilistic models that generate data via a hierarchy of production rules. For classification, we show that having power-law distributed production rules results in a power-law learning curve with an exponent depending on the rules' distribution and a large multiplicative constant that depends on the hierarchical structure. By contrast, for next-token prediction, the distribution of production rules controls the local details of the learning curve, but not the exponent describing the large-scale behaviour.