CVDec 12, 2024

Text-Video Multi-Grained Integration for Video Moment Montage

Zhihui Yin, Ye Ma, Xipeng Cao, Bo Wang, Quan Chen, Peng Jiang

arXiv:2412.09276v15.23 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses the labor-intensive process of manual video editing for users of online short video platforms, though it is incremental as it builds on existing text-video alignment techniques.

The paper tackles the Video Moment Montage task, which automates video editing by locating and assembling video segments based on narration text, and introduces the TV-MGI method that achieves improved accuracy in aligning video content with textual descriptions, as demonstrated on the new MSSD dataset.

The proliferation of online short video platforms has driven a surge in user demand for short video editing. However, manually selecting, cropping, and assembling raw footage into a coherent, high-quality video remains laborious and time-consuming. To accelerate this process, we focus on a user-friendly new task called Video Moment Montage (VMM), which aims to accurately locate the corresponding video segments based on a pre-provided narration text and then arrange these video clips to create a complete video that aligns with the corresponding descriptions. The challenge lies in extracting precise temporal segments while ensuring intra-sentence and inter-sentence context consistency, as a single script sentence may require trimming and assembling multiple video clips. To address this problem, we present a novel \textit{Text-Video Multi-Grained Integration} method (TV-MGI) that efficiently fuses text features from the script with both shot-level and frame-level video features, which enables the global and fine-grained alignment between the video content and the corresponding textual descriptions in the script. To facilitate further research in this area, we introduce the Multiple Sentences with Shots Dataset (MSSD), a large-scale dataset designed explicitly for the VMM task. We conduct extensive experiments on the MSSD dataset to demonstrate the effectiveness of our framework compared to baseline methods.

View on arXiv PDF

Similar