LGJul 22, 2024

Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training

Cheng Luo, Jiawei Zhao, Zhuoming Chen, Beidi Chen, Anima Anandkumar

arXiv:2407.15892v45 citationsh-index: 20

Originality Incremental advance

AI Analysis

This addresses memory bottlenecks for researchers and practitioners training LLMs on long sequences, offering a practical, incremental improvement.

The paper tackles the problem of high intermediate memory usage in training large language models with long sequences by introducing Mini-Sequence Transformer (MsT), which partitions sequences and processes mini-sequences iteratively, resulting in no degradation in throughput or convergence with 12x longer sequences for Llama3-8B and extending context lengths by 12-24x for other models.

We introduce Mini-Sequence Transformer (MsT), a simple and effective methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. Integrated with activation recomputation, it enables significant memory savings in both forward and backward passes. In experiments with the Llama3-8B model, with MsT, we measure no degradation in throughput or convergence even with 12x longer sequences than standard implementations. MsT is fully general, implementation-agnostic, and requires minimal code changes to integrate with existing LLM training frameworks. Integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.

View on arXiv PDF

Similar