CVDec 5, 2025

ShaRP: SHAllow-LayeR Pruning for Video Large Language Models Acceleration

arXiv:2512.05385v1
Originality Incremental advance
AI Analysis

This addresses the problem of inference acceleration for VLLMs, which is incremental as it builds on existing attention-based pruning methods.

The paper tackled the high computational load in Video Large Language Models (VLLMs) during pre-filling by proposing ShaRP, an improved attention-based pruning framework that integrates segment-aware causal masking, positional debiasing, and token deduplication, achieving competitive performance across multiple video understanding benchmarks without retraining.

Video Large Language Models (VLLMs) face the challenge of high computational load during the pre-filling stage due to the processing of an enormous number of visual tokens. Although attention-based pruning methods are widely used to accelerate inference, trials at early decoder layers often result in significant performance degradation, especially under high compression rates. We argue that while attention-based pruning inherently holds the potential to identify the most relevant visual tokens, its effectiveness in shallow decoder layers is limited by factors such as positional encoding bias and insufficient information interaction. In this paper, we propose an improved attention-based pruning framework, termed ShaRP, that integrates segment-aware causal masking, positional debiasing, and token deduplication for enhanced token selection. It enables effective pruning at shallow layers while maintaining stable performance under high compression rates without retraining. Extensive experiments demonstrate that ShaRP achieves competitive performance across multiple video understanding benchmarks, establishing a new paradigm for accelerating VLLM inference.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes