LGAIJun 27, 2024

Time Matters: Scaling Laws for Any Budget

arXiv:2406.18922v23 citations
Originality Incremental advance
AI Analysis

This work addresses the high cost of training large models for AI researchers and practitioners by providing a method to optimize architectural decisions, though it is incremental in refining existing scaling law approaches.

The paper tackles the problem of inaccurate training time estimates for large models by introducing a proxy based on memory copies, which enables accurate prediction of final loss from hyperparameters and scaling laws. The result suggests that models should be wider rather than deeper for efficiency, as speed benefits outweigh depth advantages.

A primary cost driver for training large models is wall-clock training time. We show that popular time estimates based on FLOPs are poor estimates, and construct a more accurate proxy based on memory copies. This allows us to accurately estimate the training speed of a transformer model from its hyperparameters. Combined with a scaling law curve like Chinchilla, this allows us to accurately predict the final loss of a model from a simple equation. We show that this expression is accurate across a wide range of model hyperparameter values, enabling us to analytically make architectural decisions and train models more efficiently. Crucially, this analysis predicts that in contrast to existing literature, models should be wider rather than deeper, as the benefits of speed outweigh the benefits of depth.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes