LGFeb 2, 2021

AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning

Yuhan Liu, Saurabh Agarwal, Shivaram Venkataraman

arXiv:2102.01386v219.279 citationsHas Code

Originality Highly original

AI Analysis

This work provides a significant speedup for fine-tuning large pre-trained models, which is a common practice in many machine learning domains, benefiting researchers and practitioners by reducing training time and cost.

The authors tackled the problem of slow fine-tuning of large pre-trained models like BERT, which can take many hours. They developed AutoFreeze, an adaptive system that accelerates fine-tuning while preserving accuracy. AutoFreeze achieved up to 2.55x speedup on a single GPU and up to 4.38x speedup on a 64 GPU cluster for end-to-end training time, or 5.03x cost reduction, without affecting model accuracy.

With the rapid adoption of machine learning (ML), a number of domains now use the approach of fine tuning models which were pre-trained on a large corpus of data. However, our experiments show that even fine-tuning on models like BERT can take many hours even when using modern accelerators like GPUs. While prior work proposes limiting the number of layers that are fine-tuned, e.g., freezing all layers but the last layer, we find that such static approaches lead to reduced accuracy. We propose, AutoFreeze, a system that uses an adaptive approach to choose which layers are trained and show how this can accelerate model fine-tuning while preserving accuracy. We also develop mechanisms to enable efficient caching of intermediate activations which can reduce the forward computation time when performing fine-tuning. We extend AutoFreeze to perform distributed fine-tuning and design two execution modes that minimize cost and running time respectively. Our evaluation on ten NLP tasks shows that AutoFreeze, with caching enabled, can improve fine-tuning on a single GPU by up to 2.55x. On a 64 GPU cluster, for fine-tuning on the AG's news dataset, AutoFreeze is able to achieve up to 4.38x speedup when optimizing for end-to-end training time and 5.03x reduction in total cost when optimizing for efficiency, without affecting model accuracy.

View on arXiv PDF Code

Similar