LG AI CC CLMar 14, 2025

Time and Memory Trade-off of KV-Cache Compression in Tensor Transformer Decoding

Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yu Tian

arXiv:2503.11108v27.12 citationsh-index: 21

Originality Incremental advance

AI Analysis

This provides a theoretical foundation for developing more memory-efficient tensor attention transformer architectures, addressing a specific bottleneck in transformer inference.

The paper tackles the memory bottleneck of key-value (KV) cache in tensor transformer decoding by generalizing space complexity barriers to tensor attention and deducing memory lower bounds, while introducing two tensor attention cache types that reveal time-memory trade-offs.

The key-value (KV) cache in the tensor version of transformers presents a significant bottleneck during inference. While previous work analyzes the fundamental space complexity barriers in standard attention mechanisms [Haris and Onak, 2025], our work generalizes the space complexity barriers result to tensor attention version. Our theoretical contributions rely on a reduction from communication complexity and deduce the memory lower bound for tensor-structured attention mechanisms when $d = Ω(\log n)$. Furthermore, we introduce two types of tensor attention cache and present a trade-off between time and memory for two scenarios. Overall, our work provides a theoretical foundation for us to understand the time-memory tradeoff of KV-Cache compression in tensor attention decoding and offers more perspectives in developing more memory-efficient tensor attention Transformer architectures.

View on arXiv PDF

Similar