CVJan 10, 2024

Large Model based Sequential Keyframe Extraction for Video Summarization

arXiv:2401.04962v127 citationsh-index: 4CMLDS
Originality Incremental advance
AI Analysis

This is an incremental improvement for video analysis applications, offering better performance in keyframe extraction.

The paper tackles video summarization by extracting keyframes to represent video semantics with minimal frames, proposing LMSKE which uses large models for shot segmentation and feature extraction, adaptive clustering for candidate selection, and redundancy elimination. Results show it outperforms SOTA competitors with average F1 of 0.5311, fidelity of 0.8141, and compression ratio of 0.9922.

Keyframe extraction aims to sum up a video's semantics with the minimum number of its frames. This paper puts forward a Large Model based Sequential Keyframe Extraction for video summarization, dubbed LMSKE, which contains three stages as below. First, we use the large model "TransNetV21" to cut the video into consecutive shots, and employ the large model "CLIP2" to generate each frame's visual feature within each shot; Second, we develop an adaptive clustering algorithm to yield candidate keyframes for each shot, with each candidate keyframe locating nearest to a cluster center; Third, we further reduce the above candidate keyframes via redundancy elimination within each shot, and finally concatenate them in accordance with the sequence of shots as the final sequential keyframes. To evaluate LMSKE, we curate a benchmark dataset and conduct rich experiments, whose results exhibit that LMSKE performs much better than quite a few SOTA competitors with average F1 of 0.5311, average fidelity of 0.8141, and average compression ratio of 0.9922.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes