ML LGOct 11, 2024

Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

arXiv:2410.09005v16 citationsh-index: 2ICLR

Originality Incremental advance

AI Analysis

This work provides incremental theoretical insights into scaling laws for researchers in machine learning theory, focusing on specific network architectures and data structures.

The paper tackled the theoretical understanding of neural scaling laws by analyzing two-layer neural networks with power-law data spectra, deriving analytical expressions for generalization error and revealing transitions from exponential to power-law convergence in specific phases.

Neural scaling laws describe how the performance of deep neural networks scales with key factors such as training data size, model complexity, and training time, often following power-law behaviors over multiple orders of magnitude. Despite their empirical observation, the theoretical understanding of these scaling laws remains limited. In this work, we employ techniques from statistical mechanics to analyze one-pass stochastic gradient descent within a student-teacher framework, where both the student and teacher are two-layer neural networks. Our study primarily focuses on the generalization error and its behavior in response to data covariance matrices that exhibit power-law spectra. For linear activation functions, we derive analytical expressions for the generalization error, exploring different learning regimes and identifying conditions under which power-law scaling emerges. Additionally, we extend our analysis to non-linear activation functions in the feature learning regime, investigating how power-law spectra in the data covariance matrix impact learning dynamics. Importantly, we find that the length of the symmetric plateau depends on the number of distinct eigenvalues of the data covariance matrix and the number of hidden units, demonstrating how these plateaus behave under various configurations. In addition, our results reveal a transition from exponential to power-law convergence in the specialized phase when the data covariance matrix possesses a power-law spectrum. This work contributes to the theoretical understanding of neural scaling laws and provides insights into optimizing learning performance in practical scenarios involving complex data structures.

View on arXiv PDF

Similar