Question-Answering Dense Video Events
It addresses the challenge of enabling MLLMs to comprehend and reason about multiple events in long videos, which is an incremental advancement in video understanding.
The paper tackles the problem of question-answering on dense video events by introducing a new task and dataset, DeVE-QA, and proposing DeVi, a training-free MLLM approach that improves GQA accuracy by 4.8% on DeVE-QA and 2.1% on NExT-GQA.
This paper presents question-answering on dense video events, a novel task that answers and grounds dense-event questions in long videos, thus challenging MLLMs to faithfully comprehend and reason about multiple events over extended periods of time. To facilitate the study, we construct DeVE-QA -- a dataset featuring 78K questions about 26K events on 10.6K long videos. Our benchmarking shows that state-of-the-art MLLMs struggle on DeVE-QA. For improvement, we propose DeVi, a novel training-free MLLM approach that highlights a hierarchical captioning module, a temporal event memory module, and a self-consistency checking module to respectively detect, contextualize and memorize, and ground dense-events in long videos for question answering. Extensive experiments show that DeVi is superior at answering dense-event questions and grounding relevant video moments. Compared with existing MLLMs, it achieves a notable increase of 4.8% and 2.1% for G(round)QA accuracy on DeVE-QA and NExT-GQA, respectively. Data and code are available at https://github.com/QHUni/DeVE-QA.