CVAISep 30, 2025

An Experimental Study on Generating Plausible Textual Explanations for Video Summarization

arXiv:2509.26225v1h-index: 13CBMI
Originality Synthesis-oriented
AI Analysis

This work addresses the need for explainable AI in video summarization by focusing on plausibility, but it is incremental as it builds on an existing framework and uses standard evaluation methods.

The study tackled the problem of generating plausible textual explanations for video summarization by integrating a large multimodal model into an existing framework and evaluating plausibility through semantic overlap between textual descriptions of visual explanations and video summaries. The result showed that more faithful explanations are not necessarily more plausible, and identified the most appropriate approach using two datasets and a state-of-the-art method.

In this paper, we present our experimental study on generating plausible textual explanations for the outcomes of video summarization. For the needs of this study, we extend an existing framework for multigranular explanation of video summarization by integrating a SOTA Large Multimodal Model (LLaVA-OneVision) and prompting it to produce natural language descriptions of the obtained visual explanations. Following, we focus on one of the most desired characteristics for explainable AI, the plausibility of the obtained explanations that relates with their alignment with the humans' reasoning and expectations. Using the extended framework, we propose an approach for evaluating the plausibility of visual explanations by quantifying the semantic overlap between their textual descriptions and the textual descriptions of the corresponding video summaries, with the help of two methods for creating sentence embeddings (SBERT, SimCSE). Based on the extended framework and the proposed plausibility evaluation approach, we conduct an experimental study using a SOTA method (CA-SUM) and two datasets (SumMe, TVSum) for video summarization, to examine whether the more faithful explanations are also the more plausible ones, and identify the most appropriate approach for generating plausible textual explanations for video summarization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes