CVApr 30, 2025

Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering

arXiv:2504.21403v28.42 citationsh-index: 9Has CodeEMNLP

Originality Incremental advance

AI Analysis

This addresses memory and performance challenges in video question answering for AI applications, though it is an incremental improvement over existing compression methods.

The paper tackles the problem of inefficient token usage in video question answering by proposing a query-adaptive token selection strategy that balances static and dynamic information, achieving performance improvements of up to 5.8% on multiple benchmarks.

Video question answering benefits from the rich information in videos, enabling various applications. However, the large volume of tokens generated from long videos presents challenges to memory efficiency and model performance. To alleviate this, existing works propose to compress video inputs, but often overlook the varying importance of static and dynamic information across different queries, leading to inefficient token usage within limited budgets. We propose a novel token selection strategy, \textsc{explore-then-select}, that adaptively adjusts static and dynamic information based on question requirements. Our framework first explores different token allocations between key frames, which preserve spatial details, and delta frames, which capture temporal changes. Then it employs a query-aware attention-based metric to select the optimal token combination without model updates. Our framework is plug-and-play and can be seamlessly integrated within diverse video language models. Extensive experiments show that our method achieves significant performance improvements (up to 5.8\%) on multiple video question answering benchmarks. Our code is available at https://github.com/ANDgate99/Explore-Then-Select .

View on arXiv PDF Code

Similar