CVNov 19, 2024

DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding

arXiv:2411.12355v211 citationsh-index: 6CVPR
Originality Incremental advance
AI Analysis

This addresses memory-efficient video encoding for LLMs, but it is incremental as it builds on existing methods with dynamic frame selection.

The paper tackles the challenge of preserving visual and semantic information in long videos for LLM-based video understanding while reducing token count, and the proposed DynFocus method achieves competitive performance on five benchmarks.

The challenge in LLM-based video understanding lies in preserving visual and semantic information in long videos while maintaining a memory-affordable token count. However, redundancy and correspondence in videos have hindered the performance potential of existing methods. Through statistical learning on current datasets, we observe that redundancy occurs in both repeated and answer-irrelevant frames, and the corresponding frames vary with different questions. This suggests the possibility of adopting dynamic encoding to balance detailed video information preservation with token budget reduction. To this end, we propose a dynamic cooperative network, DynFocus, for memory-efficient video encoding in this paper. Specifically, i) a Dynamic Event Prototype Estimation (DPE) module to dynamically select meaningful frames for question answering; (ii) a Compact Cooperative Encoding (CCE) module that encodes meaningful frames with detailed visual appearance and the remaining frames with sketchy perception separately. We evaluate our method on five publicly available benchmarks, and experimental results consistently demonstrate that our method achieves competitive performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes