CVJun 12, 2025

MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment

arXiv:2506.10430v16.23 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the problem of capturing semantic richness in videos for content creators and viewers, but it is incremental as it builds on existing multimodal fusion approaches.

The paper tackles video summarization by integrating visual and auditory information with MF2Summ, achieving competitive performance with F1-score improvements of 1.9% on SumMe and 0.6% on TVSum over DSNet.

The rapid proliferation of online video content necessitates effective video summarization techniques. Traditional methods, often relying on a single modality (typically visual), struggle to capture the full semantic richness of videos. This paper introduces MF2Summ, a novel video summarization model based on multimodal content understanding, integrating both visual and auditory information. MF2Summ employs a five-stage process: feature extraction, cross-modal attention interaction, feature fusion, segment prediction, and key shot selection. Visual features are extracted using a pre-trained GoogLeNet model, while auditory features are derived using SoundNet. The core of our fusion mechanism involves a cross-modal Transformer and an alignment-guided self-attention Transformer, designed to effectively model inter-modal dependencies and temporal correspondences. Segment importance, location, and center-ness are predicted, followed by key shot selection using Non-Maximum Suppression (NMS) and the Kernel Temporal Segmentation (KTS) algorithm. Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ achieves competitive performance, notably improving F1-scores by 1.9\% and 0.6\% respectively over the DSNet model, and performing favorably against other state-of-the-art methods.

View on arXiv PDF

Similar