LG CLMay 22, 2025

ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training

Maryam Dialameh, Rezaul Karim, Hossein Rajabzadeh, Omar Mohamed Awad, Hyock Ju Kwon, Boxing Chen, Walid Ahmed, Yang Liu

arXiv:2505.17331v23 citationsh-index: 6EMNLP

AI Analysis

It addresses efficiency bottlenecks in training large language models, offering a scalable solution for faster and more cost-effective pretraining and finetuning, though it is incremental as it builds on existing LLaMA architectures.

This paper tackled the problem of improving training speed and inference throughput for LLaMA models by introducing ECHO-LLaMA, which uses shared KV caching across layers to reduce computational complexity, resulting in up to 77% higher training throughput, 16% higher MFU, and 14% lower loss on equal tokens.

This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into shared KV caching across certain layers, significantly reducing KV computational complexity while maintaining or improving language performance. Experimental results demonstrate that ECHO-LLaMA achieves up to 77\% higher token-per-second throughput during training, up to 16\% higher Model FLOPs Utilization (MFU), and up to 14\% lower loss when trained on an equal number of tokens. Furthermore, on the 1.1B model, ECHO-LLaMA delivers approximately 7\% higher test-time throughput compared to the baseline. By introducing a computationally efficient adaptation mechanism, ECHO-LLaMA offers a scalable and cost-effective solution for pretraining and finetuning large language models, enabling faster and more resource-efficient training without compromising performance.

View on arXiv PDF

Similar