CVSep 16, 2025

HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models

arXiv:2509.13067v22 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses the computational overhead in high-resolution vision-language models, offering a practical solution for efficient inference, though it is incremental as it builds on existing divide-and-conquer paradigms.

The paper tackles the computational inefficiency of high-resolution large vision-language models by proposing HERO, a framework that selectively drops visual tokens based on importance, achieving superior efficiency-accuracy trade-offs across benchmarks without training.

By cropping high-resolution images into local tiles and encoding them independently, High-Resolution Large Vision-Language Models (HR-LVLMs) have demonstrated remarkable fine-grained visual understanding capabilities. However, this divide-and-conquer paradigm significantly increases the number of visual tokens, resulting in substantial computational and memory overhead. To better understand and address this challenge, we empirically investigate visual token utilization in HR-LVLMs and uncover three key findings: (1) the local tiles have varying importance, jointly determined by visual saliency and task relevance; (2) the CLS token in CLIP-based vision encoders exhibits a two-stage attention pattern across layers, with each stage attending to different types of visual tokens; (3) the visual tokens emphasized at different stages encode information at varying levels of granularity, playing complementary roles within LVLMs. Building on these insights, we propose HERO, a High-resolution visual token early dropping framework that integrates content-adaptive token budget allocation with function-aware token selection. By accurately estimating tile-level importance and selectively retaining visual tokens with complementary roles, HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales, all in a training-free manner. This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes