CVCLMMMar 16, 2025

AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding

arXiv:2503.12559v248 citationsh-index: 9Has CodeACL
Originality Highly original
AI Analysis

This addresses the challenge of processing long videos for video-language understanding, offering a flexible compression strategy that improves state-of-the-art models, though it is incremental as it builds on existing redundancy reduction methods.

The paper tackles the problem of limited context length in Multimodal Large Language Models for video understanding by proposing AdaReTaKe, a training-free method that adaptively reduces visual redundancy across time and layers, increasing processing capacity from 256 to 2048 frames and outperforming existing methods by up to 6.0% on benchmarks.

Multimodal Large Language Models (MLLMs) have revolutionized video understanding, yet are still limited by context length when processing long videos. Recent methods compress videos by leveraging visual redundancy uniformly, yielding promising results. Nevertheless, our quantitative analysis shows that redundancy varies significantly across time and model layers, necessitating a more flexible compression strategy. We propose AdaReTaKe, a training-free method that flexibly reduces visual redundancy by allocating compression ratios among time and layers with theoretical guarantees. Integrated into state-of-the-art MLLMs, AdaReTaKe improves processing capacity from 256 to 2048 frames while preserving critical information. Experiments on VideoMME, MLVU, LongVideoBench, and LVBench datasets demonstrate that AdaReTaKe outperforms existing methods by 2.3% and 2.8% for 7B and 72B models, respectively, with even greater improvements of 5.9% and 6.0% on the longest LVBench. Our code is available at https://github.com/SCZwangxiao/video-FlexReduc.git.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes