LGMay 22, 2025

Small-to-Large Generalization: Data Influences Models Consistently Across Scale

arXiv:2505.16260v11 citationsh-index: 54
Originality Incremental advance
AI Analysis

This work addresses the challenge of extrapolating from small proxy models to large-scale settings for data attribution and dataset selection, which is incremental in improving efficiency.

The study tackled the problem of understanding how training data distribution influences model behavior across different compute scales, finding that small- and large-scale language model predictions generally correlate highly across data choices.

Choice of training data distribution greatly influences model behavior. Yet, in large-scale settings, precisely characterizing how changes in training data affects predictions is often difficult due to model training costs. Current practice is to instead extrapolate from scaled down, inexpensive-to-train proxy models. However, changes in data do not influence smaller and larger models identically. Therefore, understanding how choice of data affects large-scale models raises the question: how does training data distribution influence model behavior across compute scale? We find that small- and large-scale language model predictions (generally) do highly correlate across choice of training data. Equipped with these findings, we characterize how proxy scale affects effectiveness in two downstream proxy model applications: data attribution and dataset selection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes