LGMay 21

ASAP: Attention Sink Anchored Pruning

arXiv:2605.2237270.4
AI Analysis

For practitioners deploying ViTs on high-resolution inputs, ASAP provides a training-free token reduction method that outperforms prior approaches by leveraging the attention sink as a feature rather than a bug.

ASAP addresses the attention sink phenomenon in Vision Transformers, which causes uninformative tokens to be preserved over salient ones during token reduction. By modeling information flow as a Lazy Random Walk and using diffusion distance to the sink for clustering and pooling, ASAP achieves up to 48% throughput acceleration while maintaining or exceeding baseline accuracy across image, video, and vision-language tasks.

Vision Transformers (ViTs) face severe computational bottlenecks due to the quadratic complexity of self-attention at high resolutions. Existing token reduction methods rely on local metrics - such as single-layer attention scores - that are inherently vulnerable to the attention sink phenomenon, where uninformative tokens are paradoxically preserved over salient foreground objects. We propose ASAP (Attention Sink Anchored Pruning), a training-free framework that recasts this sink as a feature. Modeling ViT information flow as a Lazy Random Walk, ASAP identifies the sink as a dominant accumulator of probability mass. By computing the diffusion distance to the sink within the cumulative transition matrix, ASAP partitions tokens via Radial Diffusion Clustering and compresses background redundancy through Transition Weight Pooling in a single shot. Extensive experiments across image, video, and vision-language tasks demonstrate ASAP outperforms state-of-the-art methods, accelerating throughput by up to 48% while maintaining - or even exceeding - baseline accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes