CVAIMMMay 6, 2025

SD-VSum: A Method and Dataset for Script-Driven Video Summarization

arXiv:2505.03319v24 citationsh-index: 13MM
Originality Incremental advance
AI Analysis

This work addresses the problem of generating personalized video summaries based on user-provided scripts, which is incremental as it builds on existing video summarization datasets and techniques.

The authors tackled script-driven video summarization by introducing a new dataset and a cross-modal attention network, achieving state-of-the-art performance against existing methods.

In this work, we introduce the task of script-driven video summarization, which aims to produce a summary of the full-length video by selecting the parts that are most relevant to a user-provided script outlining the visual content of the desired summary. Following, we extend a recently-introduced large-scale dataset for generic video summarization (VideoXum) by producing natural language descriptions of the different human-annotated summaries that are available per video. In this way we make it compatible with the introduced task, since the available triplets of ``video, summary and summary description'' can be used for training a method that is able to produce different summaries for a given video, driven by the provided script about the content of each summary. Finally, we develop a new network architecture for script-driven video summarization (SD-VSum), that employs a cross-modal attention mechanism for aligning and fusing information from the visual and text modalities. Our experimental evaluations demonstrate the advanced performance of SD-VSum against SOTA approaches for query-driven and generic (unimodal and multimodal) summarization from the literature, and document its capacity to produce video summaries that are adapted to each user's needs about their content.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes