CVJul 12, 2025

ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models

arXiv:2507.09313v214 citationsh-index: 7Has Code
AI Analysis

This work addresses the need for better evaluation methods in multimodal dialogue systems, particularly for proactive interactions, though it is incremental as it focuses on benchmarking and metrics rather than new model capabilities.

The authors tackled the problem of evaluating proactive interactions in video large language models by introducing ProactiveVideoQA, the first comprehensive benchmark, and PAUC, a novel metric accounting for temporal dynamics. They showed that PAUC aligns better with human preferences than traditional metrics, providing a more accurate assessment of user experience.

With the growing research focus on multimodal dialogue systems, the capability for proactive interaction is gradually gaining recognition. As an alternative to conventional turn-by-turn dialogue, users increasingly expect multimodal systems to be more initiative, for example, by autonomously determining the timing of multi-turn responses in real time during video playback. To facilitate progress in this emerging area, we introduce ProactiveVideoQA, the first comprehensive benchmark to evaluate a system's ability to engage in proactive interaction. Since model responses are generated at varying timestamps, we further propose PAUC, the first metric that accounts for the temporal dynamics of model responses. This enables a more accurate evaluation of systems operating in proactive settings. Through extensive benchmarking of various baseline systems on ProactiveVideoQA and a user study of human preferences, we show that PAUC is in better agreement with human preferences than traditional evaluation metrics, which typically only consider the textual content of responses. These findings demonstrate that PAUC provides a more faithful assessment of user experience in proactive interaction scenarios. Project homepage: https://github.com/yellow-binary-tree/ProactiveVideoQA

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes