Gemstones: A Model Suite for Multi-Faceted Scaling Laws
This work addresses the reproducibility and robustness of scaling laws for researchers in machine learning, though it is incremental as it builds on existing scaling law studies.
The authors tackled the problem of scaling laws being overly sensitive to experimental design by creating Gemstones, an open-source dataset of over 4000 transformer checkpoints up to 2 billion parameters, and found that scaling law prescriptions vary significantly with architectural and hyperparameter choices.
Scaling laws are typically fit using a family of models with a narrow range of frozen hyperparameter choices. In this work we study scaling laws using multiple architectural shapes and hyperparameter choices, highlighting their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: an open-source scaling law dataset, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters and diverse architectural shapes; including ablations over learning rate and cooldown. Our checkpoints enable more complex studies of scaling, such as analyzing the relationship between width and depth. By examining our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting.