LG AIFeb 11, 2025

TransMLA: Multi-Head Latent Attention Is All You Need

Fanxu Meng, Pingzhi Tang, Xiaojuan Tang, Zengwei Yao, Xing Sun, Muhan Zhang

arXiv:2502.07864v524 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This provides a practical solution for migrating models to leverage DeepSeek-specific optimizations, offering significant inference acceleration for users of such frameworks.

The paper tackles the problem of converting GQA-based pre-trained models to MLA-based models for improved inference efficiency, achieving a 10.6x speedup at 8K context length by compressing 93% of the KV cache in LLaMA-2-7B while maintaining output quality.

In this paper, we present TransMLA, a framework that seamlessly converts any GQA-based pre-trained model into an MLA-based model. Our approach enables direct compatibility with DeepSeek's codebase, allowing these models to fully leverage DeepSeek-specific optimizations such as vLLM and SGlang. By compressing 93% of the KV cache in LLaMA-2-7B, TransMLA achieves a 10.6x inference speedup at an 8K context length while preserving meaningful output quality. Additionally, the model requires only 6 billion tokens for fine-tuning to regain performance on par with the original across multiple benchmarks. TransMLA offers a practical solution for migrating GQA-based models to the MLA structure. When combined with DeepSeek's advanced features, such as FP8 quantization and Multi-Token Prediction, even greater inference acceleration can be realized.

View on arXiv PDF Code

Similar