CLLGJul 11, 2025

Lizard: An Efficient Linearization Framework for Large Language Models

arXiv:2507.09025v35 citationsh-index: 16
Originality Highly original
AI Analysis

This addresses efficiency issues for users of large language models by enabling faster and more memory-efficient inference with long sequences, representing a novel method rather than an incremental improvement.

The paper tackles the computational and memory bottlenecks of Transformer-based LLMs with long sequences by proposing Lizard, a linearization framework that transforms them into subquadratic architectures, achieving near-lossless performance recovery and outperforming previous methods by up to 9.4-24.5 points on the 5-shot MMLU benchmark.

We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into subquadratic architectures. Transformers faces severe computational and memory bottlenecks with long sequences due to the quadratic complexity of softmax attention and the growing Key-Value (KV) cache that makes inference memory-bound by context length. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving model quality. Unlike prior linearization methods constrained by fixed, non-adaptive structures, Lizard augments the architecture with compact, learnable modules that enable adaptive memory control and robust length generalization. Moreover, we introduce a hardwareaware algorithm that solves numerical instability in gated attention to accelerate training. Extensive experiments show that Lizard achieves near-lossless recovery of its teacher model's performance, significantly outperforming previous methods by up to 9.4 - 24.5 points on the 5-shot MMLU benchmark and demonstrating superior associative recall.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes