LGCLMay 22, 2025

ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training

arXiv:2505.17331v23 citationsh-index: 6EMNLP
AI Analysis

It addresses efficiency bottlenecks in training large language models, offering a scalable solution for faster and more cost-effective pretraining and finetuning, though it is incremental as it builds on existing LLaMA architectures.

This paper tackled the problem of improving training speed and inference throughput for LLaMA models by introducing ECHO-LLaMA, which uses shared KV caching across layers to reduce computational complexity, resulting in up to 77% higher training throughput, 16% higher MFU, and 14% lower loss on equal tokens.

This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into shared KV caching across certain layers, significantly reducing KV computational complexity while maintaining or improving language performance. Experimental results demonstrate that ECHO-LLaMA achieves up to 77\% higher token-per-second throughput during training, up to 16\% higher Model FLOPs Utilization (MFU), and up to 14\% lower loss when trained on an equal number of tokens. Furthermore, on the 1.1B model, ECHO-LLaMA delivers approximately 7\% higher test-time throughput compared to the baseline. By introducing a computationally efficient adaptation mechanism, ECHO-LLaMA offers a scalable and cost-effective solution for pretraining and finetuning large language models, enabling faster and more resource-efficient training without compromising performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes