Multi-matrix Factorization Attention
This addresses efficiency issues for large language models by enabling reduced memory usage with minimal performance loss, though it is incremental as it builds on existing attention variants.
The paper tackles the problem of maintaining strong performance under stringent Key-Value cache constraints in attention architectures, proposing Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR), which reduce KV cache usage by up to 56% and 93.7% while outperforming or matching existing methods.
We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA's design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.