CV AIApr 27

QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

arXiv:2604.2405224.1

AI Analysis

For researchers in video summarization, QEVA provides a practical evaluation metric that does not require reference summaries, addressing a key bottleneck in the field.

QEVA is a reference-free metric for evaluating video summaries using multimodal question answering, assessing coverage, factuality, and chronology. It achieves higher correlation with human judgments (Kendall's τ_b, τ_c, Spearman's ρ) than existing methods on a new benchmark of 800 summaries from 200 videos.

Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Chronology. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows higher correlation with human judgments compared to existing approaches, as measured by Kendall's $τ_b$, $τ_c$, and Spearman's $ρ$. We hope that our benchmark and metric will facilitate meaningful progress in video-to-text summarization research and provide valuable insights for the development of future evaluation methods.

View on arXiv PDF

Similar