LG CLDec 29, 2025

Trellis: Learning to Compress Key-Value Memory in Attention Models

Mahdi Karami, Ali Behrouz, Praneeth Kacham, Vahab Mirrokni

arXiv:2512.23852v116.97 citationsh-index: 62

Originality Highly original

AI Analysis

This addresses memory and efficiency bottlenecks in attention models for long-context applications, representing a novel method rather than an incremental improvement.

The paper tackles the quadratic computational complexity and growing key-value cache in Transformers by introducing Trellis, a novel architecture with bounded memory that dynamically compresses the cache at test time, achieving performance gains that increase with sequence length in tasks like language modeling and common-sense reasoning.

Transformers, while powerful, suffer from quadratic computational complexity and the ever-growing Key-Value (KV) cache of the attention mechanism. This paper introduces Trellis, a novel Transformer architecture with bounded memory that learns how to compress its key-value memory dynamically at test time. Trellis replaces the standard KV cache with a fixed-size memory and train a two-pass recurrent compression mechanism to store new keys and values into memory. To achieve this, it leverages an online gradient descent procedure with a forget gate, enabling the compressed memory to be updated recursively while learning to retain important contextual information from incoming tokens at test time. Extensive experiments on language modeling, common-sense reasoning, recall-intensive tasks, and time series show that the proposed architecture outperforms strong baselines. Notably, its performance gains increase as the sequence length grows, highlighting its potential for long-context applications.

View on arXiv PDF

Similar