LGJun 4, 2024

A Study of Optimizations for Fine-tuning Large Language Models

arXiv:2406.02290v212 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of hardware constraints for users fine-tuning large language models, but it is incremental as it synthesizes and compares existing optimizations rather than introducing new methods.

The study tackled the memory-intensive challenge of fine-tuning large language models by evaluating optimization techniques like Gradient Checkpointing and Low-Rank Adaptation, resulting in recommendations for balancing memory and runtime across diverse model sizes, including strategies for models with tens or hundreds of billions of parameters.

Fine-tuning large language models is a popular choice among users trying to adapt them for specific applications. However, fine-tuning these models is a demanding task because the user has to examine several factors, such as resource budget, runtime, model size and context length among others. A specific challenge is that fine-tuning is memory intensive, imposing constraints on the required hardware memory and context length of training data that can be handled. In this work, we share a detailed study on a variety of fine-tuning optimizations across different fine-tuning scenarios. In particular, we assess Gradient Checkpointing, Low-Rank Adaptation, DeepSpeed's Zero Redundancy Optimizer and FlashAttention. With a focus on memory and runtime, we examine the impact of different optimization combinations on GPU memory usage and execution runtime during fine-tuning phase. We provide our recommendation on the best default optimization for balancing memory and runtime across diverse model sizes. We share effective strategies for fine-tuning very large models with tens or hundreds of billions of parameters and enabling large context lengths during fine-tuning. Furthermore, we propose the appropriate optimization mixtures for fine-tuning under GPU resource limitations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes