CV IR MM IVJun 3, 2025

TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations

Mert Can Cakmak, Nitin Agarwal, Diwash Poudel

arXiv:2506.05395v28.41 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work addresses video summarization and retrieval, offering an incremental improvement over prior multimodal approaches.

The paper tackles the problem of efficient keyframe extraction for video summarization by introducing TriPSS, a tri-modal framework that integrates perceptual, structural, and semantic features, achieving state-of-the-art performance on TVSum20 and SumMe benchmarks.

Efficient keyframe extraction is critical for video summarization and retrieval, yet capturing the full semantic and visual richness of video content remains challenging. We introduce TriPSS, a tri-modal framework that integrates perceptual features from the CIELAB color space, structural embeddings from ResNet-50, and semantic context from frame-level captions generated by LLaMA-3.2-11B-Vision-Instruct. These modalities are fused using principal component analysis to form compact multi-modal embeddings, enabling adaptive video segmentation via HDBSCAN clustering. A refinement stage incorporating quality assessment and duplicate filtering ensures the final keyframe set is both concise and semantically diverse. Evaluations on the TVSum20 and SumMe benchmarks show that TriPSS achieves state-of-the-art performance, significantly outperforming both unimodal and prior multimodal approaches. These results highlight TriPSS' ability to capture complementary visual and semantic cues, establishing it as an effective solution for video summarization, retrieval, and large-scale multimedia understanding.

View on arXiv PDF

Similar