LG CLJan 27

Neural Neural Scaling Laws

Michael Y. Hu, Jane Pan, Ayush Rajesh Jhaveri, Nicholas Lourie, Kyunghyun Cho

arXiv:2601.19831v12.71 citationsh-index: 11Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of accurately forecasting model scaling for researchers and practitioners, though it is incremental as it builds on existing scaling law frameworks with a data-driven approach.

The paper tackled the problem of predicting language model performance on downstream tasks from validation perplexity, which suffers from averaging token-level losses and limited parametric families, by proposing Neural Neural Scaling Laws (NeuNeu), a neural network that predicts future performance with 2.04% mean absolute error, a 38% reduction compared to logistic scaling laws.

Neural scaling laws predict how language model performance improves with increased compute. While aggregate metrics like validation loss can follow smooth power-law curves, individual downstream tasks exhibit diverse scaling behaviors: some improve monotonically, others plateau, and some even degrade with scale. We argue that predicting downstream performance from validation perplexity suffers from two limitations: averaging token-level losses obscures signal, and no simple parametric family can capture the full spectrum of scaling behaviors. To address this, we propose Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling law prediction as time-series extrapolation. NeuNeu combines temporal context from observed accuracy trajectories with token-level validation losses, learning to predict future performance without assuming any bottleneck or functional form. Trained entirely on open-source model checkpoints from HuggingFace, NeuNeu achieves 2.04% mean absolute error in predicting model accuracy on 66 downstream tasks -- a 38% reduction compared to logistic scaling laws (3.29% MAE). Furthermore, NeuNeu generalizes zero-shot to unseen model families, parameter counts, and downstream tasks. Our work suggests that predicting downstream scaling laws directly from data outperforms parametric alternatives.

View on arXiv PDF

Similar