CVMay 14, 2024

Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video

arXiv:2405.08890v26 citationsh-index: 5MMAsia
Originality Incremental advance
AI Analysis

This work addresses the need for efficient and less subjective video summarization for users in multimedia and AI applications, though it is incremental as it builds on existing self-supervised and LLM-based approaches.

The paper tackles the problem of video summarization by proposing a self-supervised method that uses large language models to generate text summaries from video captions, optimizing with a novel loss function for video diversity, achieving state-of-the-art performance on the SumMe dataset with improved rank correlation coefficients.

Current video summarization methods rely heavily on supervised computer vision techniques, which demands time-consuming and subjective manual annotations. To overcome these limitations, we investigated self-supervised video summarization. Inspired by the success of Large Language Models (LLMs), we explored the feasibility in transforming the video summarization task into a Natural Language Processing (NLP) task. By leveraging the advantages of LLMs in context understanding, we aim to enhance the effectiveness of self-supervised video summarization. Our method begins by generating captions for individual video frames, which are then synthesized into text summaries by LLMs. Subsequently, we measure semantic distance between the captions and the text summary. Notably, we propose a novel loss function to optimize our model according to the diversity of the video. Finally, the summarized video can be generated by selecting the frames with captions similar to the text summary. Our method achieves state-of-the-art performance on the SumMe dataset in rank correlation coefficients. In addition, our method has a novel feature of being able to achieve personalized summarization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes