CLMar 28, 2025

EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

Jiyu Chen, Shuang Peng, Daxiong Luo, Fan Yang, Renshou Wu, Fangyuan Li, Xiaoxin Chen

arXiv:2503.22196v112.04 citationsh-index: 5ACL

Originality Incremental advance

AI Analysis

This addresses memory constraints for Transformer-based LLMs on edge devices, though it appears incremental as it builds on existing KV cache optimizations.

The paper tackles the problem of processing long sequences on edge devices by introducing EdgeInfinite, a memory-efficient Transformer that integrates compressed memory through a trainable memory-gating module, achieving comparable performance to baseline Transformers on long context benchmarks while optimizing memory consumption and time to first token.

Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices due to the quadratic complexity of attention mechanisms and growing memory demands from Key-Value (KV) cache. Existing KV cache optimizations struggle with irreversible token eviction in long-output tasks, while alternative sequence modeling architectures prove costly to adopt within established Transformer infrastructure. We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs through a trainable memory-gating module. This approach maintains full compatibility with standard Transformer architectures, requiring fine-tuning only a small part of parameters, and enables selective activation of the memory-gating module for long and short context task routing. The experimental result shows that EdgeInfinite achieves comparable performance to baseline Transformer-based LLM on long context benchmarks while optimizing memory consumption and time to first token.

View on arXiv PDF

Similar