AIAug 16, 2025

QuickMerge++: Fast Token Merging with Autoregressive Prior

arXiv:2508.13204v11 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses efficiency issues for users of large-scale generative models across language, vision, and video domains, representing an incremental improvement over prior token selection techniques.

The paper tackles the computational bottleneck of token-level processing in generative models by introducing QuickMerge, a dynamic token merging framework that reduces token counts while maintaining or exceeding performance compared to existing methods.

As generative models scale to larger inputs across language, vision, and video domains, the cost of token-level computation has become a key bottleneck. While prior work suggests that only a subset of tokens significantly influence downstream predictions, most token selection methods are static, modality-specific, or incompatible with autoregressive generation. In this paper, we propose QuickMerge, a lightweight token merging framework designed for efficient next-token prediction. QuickMerge dynamically selects a reduced number of tokens based on attention norm magnitude, guided by an entropy-based budget estimator. To preserve autoregressive compatibility, we introduce a lightweight transformer prior trained over the merged token sequence. By combining semantic salience estimation, flexible token budgets, and AR alignment, QuickMerge enables accurate generation with fewer tokens. We evaluate QuickMerge across multi-modality domains, demonstrating consistent improvements in compute-accuracy tradeoffs. Specifically, QuickMerge reduces token counts sustantially while matching as well as exceeding the performance of learned tokenizers and fixed-patch baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes