LG AIMay 23, 2024

CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization

Zi Yang, Ziyue Liu, Samridhi Choudhary, Xinfeng Xie, Cao Gao, Siegfried Kunzmann, Zheng Zhang

arXiv:2405.14377v217.016 citationsh-index: 9Has CodeNIPS

Originality Incremental advance

AI Analysis

This addresses the problem of expensive and environmentally impactful AI training, making it more accessible beyond big tech companies, though it appears incremental as an optimization of existing tensor compression techniques.

The paper tackles the high computational and memory costs of training large AI models like LLMs by introducing CoMERA, a rank-adaptive tensor optimization method that achieves 2-3× speedup per training epoch compared to standard training and outperforms recent methods like GaLore in efficiency.

Training large AI models such as LLMs and DLRMs costs massive GPUs and computing time. The high training cost has become only affordable to big tech companies, meanwhile also causing increasing concerns about the environmental impact. This paper presents CoMERA, a Computing- and Memory-Efficient training method via Rank-Adaptive tensor optimization. CoMERA achieves rank-adaptive tensor-compressed (pre)-training via a multi-objective optimization formulation and improves the training to provide both a high compression ratio and excellent accuracy in the training process. Our optimized numerical computation (e.g., optimized tensorized embedding and tensor-network contractions) and GPU implementation eliminate part of the run-time overhead in the tensorized training on GPU. This leads to, for the first time, $2-3\times$ speedup per training epoch compared with standard training. CoMERA also outperforms the recent GaLore in terms of both memory and computing efficiency. Specifically, CoMERA is $2\times$ faster per training epoch and $9\times$ more memory-efficient than GaLore on a tested six-encoder transformer with single-batch training. Our method also shows $\sim 2\times$ speedup than standard pre-training on a BERT-like code-generation LLM while achieving $4.23\times$ compression ratio in pre-training. With further HPC optimization, CoMERA may reduce the pre-training cost of many other LLMs. An implementation of CoMERA is available at https://github.com/ziyangjoy/CoMERA.

View on arXiv PDF Code

Similar