CLLGDec 2, 2024

Scaling Law for Language Models Training Considering Batch Size

arXiv:2412.01505v117 citationsh-index: 17
Originality Incremental advance
AI Analysis

This work addresses hyperparameter tuning for LLM training, offering practical insights for researchers and engineers, but it is incremental as it builds on existing scaling law research.

The paper investigates how global batch size affects large language model training, establishing scaling laws for model size and data, and provides guidance for optimizing training under resource constraints.

Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models. Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes