CVAIApr 27

QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

arXiv:2604.2405224.1
AI Analysis

For researchers in video summarization, QEVA provides a practical evaluation metric that does not require reference summaries, addressing a key bottleneck in the field.

QEVA is a reference-free metric for evaluating video summaries using multimodal question answering, assessing coverage, factuality, and chronology. It achieves higher correlation with human judgments (Kendall's τ_b, τ_c, Spearman's ρ) than existing methods on a new benchmark of 800 summaries from 200 videos.

Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Chronology. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows higher correlation with human judgments compared to existing approaches, as measured by Kendall's $τ_b$, $τ_c$, and Spearman's $ρ$. We hope that our benchmark and metric will facilitate meaningful progress in video-to-text summarization research and provide valuable insights for the development of future evaluation methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes