LGMay 9

Predicting Large Model Test Losses with a Noisy Quadratic System

arXiv:2605.0915486.6Has Code

AI Analysis

For researchers training large models, this provides a more accurate loss prediction method to optimize compute, memory, and time trade-offs, improving upon heuristic-based scaling laws.

The paper introduces a predictive model for pre-training loss of large models that handles changing batch sizes, outperforming Chinchilla's loss model in extrapolating to compute budgets up to 1000-fold. The model selects near-optimal configurations under resource constraints.

We introduce a predictive model that estimates the pre-training loss of large models from model size (N), batch size (B) and number of weight updates (K). This is the first loss prediction model that can handle changing batch size. The model outperforms Chinchilla's loss model, a model of the test loss using the batch size and number of tokens, in terms of projecting the loss at extrapolated compute budgets (up to 1000 folds). A natural use of the model is to find optimal N, B, K configurations under explicit and compound resource constraints like time, memory and compute. In our experiments, the model-selected configurations are close to ground-truth optimal. Our work advocates for loss prediction as a better alternative to heuristic-based laws, which are growing in complexity. The implementation is available on https://github.com/chuningxdy/Noisy-Quadratic-System.

View on arXiv PDF Code

Similar