An Empirical Study of Scaling Laws for Transfer
This provides a principled way to measure transfer learning efficiency, helping optimize data allocation strategies for practitioners.
The paper studied scaling laws for transfer learning in transformers, finding that a 'transfer gap' term varies significantly across datasets and determines when pre-training versus fine-tuning data collection is more cost-effective for downstream performance.
We present a limited empirical study of scaling laws for transfer learning in transformer models. More specifically, we examine a scaling law that incorporates a "transfer gap" term, indicating the effectiveness of pre-training on one distribution when optimizing for downstream performance on another distribution. When the transfer gap is low, pre-training is a cost-effective strategy for improving downstream performance. Conversely, when the gap is high, collecting high-quality fine-tuning data becomes relatively more cost effective. Fitting the scaling law to experiments from diverse datasets reveals significant variations in the transfer gap across distributions. In theory, the scaling law can inform optimal data allocation strategies and highlights how the scarcity of downstream data can bottleneck performance. Our findings contribute to a principled way to measure transfer learning efficiency and understand how data availability affects capabilities.