CVJun 30, 2024

Hierarchical Memory for Long Video QA

arXiv:2407.00603v210 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of computational expense in long video QA for researchers and practitioners, but it is incremental as it fine-tunes an existing method.

The paper tackled the challenge of processing long video sequences for question-answering by compressing visual tokens to reduce memory and latency, achieving first place in the LOVEU Challenge @ CVPR'24 Track 1.

This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging task. The key is to compress visual tokens effectively, reducing memory footprint and decoding latency, while preserving the essential information for accurate question-answering. We adopt a hierarchical memory mechanism named STAR Memory, proposed in Flash-VStream, that is capable of processing long videos with limited GPU memory (VRAM). We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge. Code is available at project homepage https://invinciblewyq.github.io/vstream-page .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes