LGAIDATA-ANMLOct 3, 2023

A Neural Scaling Law from Lottery Ticket Ensembling

arXiv:2310.02258v24 citationsh-index: 17
AI Analysis

This provides insights into scaling phenomena in neural networks, potentially impacting large language models and theoretical learning frameworks, though it is incremental as it builds on existing scaling law theories.

The paper tackles the problem of neural scaling laws by identifying a discrepancy in the predicted exponent for a simple 1D problem, attributing it to lottery ticket ensembling in wider networks, which results in a scaling law of N^{-1} instead of the expected N^{-4}.

Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharma & Kaplan analyzed NSL using approximation theory and predict that MSE losses decay as $N^{-α}$, $α=4/d$, where $N$ is the number of model parameters, and $d$ is the intrinsic input dimension. Although their theory works well for some cases (e.g., ReLU networks), we surprisingly find that a simple 1D problem $y=x^2$ manifests a different scaling law ($α=1$) from their predictions ($α=4$). We opened the neural networks and found that the new scaling law originates from lottery ticket ensembling: a wider network on average has more "lottery tickets", which are ensembled to reduce the variance of outputs. We support the ensembling mechanism by mechanistically interpreting single neural networks, as well as studying them statistically. We attribute the $N^{-1}$ scaling law to the "central limit theorem" of lottery tickets. Finally, we discuss its potential implications for large language models and statistical physics-type theories of learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes