LGJun 25, 2024

Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

arXiv:2406.17660v135 citationsHas Code
Originality Highly original
AI Analysis

This addresses memory bottlenecks in LLM training for researchers and practitioners, offering a novel approach with practical gains.

The paper tackles the problem of limited GPU memory in large language model training by proposing Grass, a method that uses sparse projections to reduce memory usage and computational costs, enabling half-precision pretraining of a 13B parameter model on a single 40GB GPU and achieving up to 2x throughput improvement.

Large language model (LLM) training and finetuning are often bottlenecked by limited GPU memory. While existing projection-based optimization methods address this by projecting gradients into a lower-dimensional subspace to reduce optimizer state memory, they typically rely on dense projection matrices, which can introduce computational and memory overheads. In this work, we propose Grass (GRAdient Stuctured Sparsification), a novel approach that leverages sparse projections to transform gradients into structured sparse updates. This design not only significantly reduces memory usage for optimizer states but also minimizes gradient memory footprint, computation, and communication costs, leading to substantial throughput improvements. Extensive experiments on pretraining and finetuning tasks demonstrate that Grass achieves competitive performance to full-rank training and existing projection-based methods. Notably, Grass enables half-precision pretraining of a 13B parameter LLaMA model on a single 40GB A100 GPU--a feat infeasible for previous methods--and yields up to a $2\times$ throughput improvement on an 8-GPU system. Code can be found at https://github.com/aashiqmuhamed/GRASS .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes