LGAICCCLMar 14, 2025

Time and Memory Trade-off of KV-Cache Compression in Tensor Transformer Decoding

arXiv:2503.11108v22 citationsh-index: 21
Originality Incremental advance
AI Analysis

This provides a theoretical foundation for developing more memory-efficient tensor attention transformer architectures, addressing a specific bottleneck in transformer inference.

The paper tackles the memory bottleneck of key-value (KV) cache in tensor transformer decoding by generalizing space complexity barriers to tensor attention and deducing memory lower bounds, while introducing two tensor attention cache types that reveal time-memory trade-offs.

The key-value (KV) cache in the tensor version of transformers presents a significant bottleneck during inference. While previous work analyzes the fundamental space complexity barriers in standard attention mechanisms [Haris and Onak, 2025], our work generalizes the space complexity barriers result to tensor attention version. Our theoretical contributions rely on a reduction from communication complexity and deduce the memory lower bound for tensor-structured attention mechanisms when $d = Ω(\log n)$. Furthermore, we introduce two types of tensor attention cache and present a trade-off between time and memory for two scenarios. Overall, our work provides a theoretical foundation for us to understand the time-memory tradeoff of KV-Cache compression in tensor attention decoding and offers more perspectives in developing more memory-efficient tensor attention Transformer architectures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes