Diwash Poudel

h-index7
2papers

2 Papers

CVJun 3, 2025
TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations

Mert Can Cakmak, Nitin Agarwal, Diwash Poudel

Efficient keyframe extraction is critical for video summarization and retrieval, yet capturing the full semantic and visual richness of video content remains challenging. We introduce TriPSS, a tri-modal framework that integrates perceptual features from the CIELAB color space, structural embeddings from ResNet-50, and semantic context from frame-level captions generated by LLaMA-3.2-11B-Vision-Instruct. These modalities are fused using principal component analysis to form compact multi-modal embeddings, enabling adaptive video segmentation via HDBSCAN clustering. A refinement stage incorporating quality assessment and duplicate filtering ensures the final keyframe set is both concise and semantically diverse. Evaluations on the TVSum20 and SumMe benchmarks show that TriPSS achieves state-of-the-art performance, significantly outperforming both unimodal and prior multimodal approaches. These results highlight TriPSS' ability to capture complementary visual and semantic cues, establishing it as an effective solution for video summarization, retrieval, and large-scale multimedia understanding.

CVJun 23, 2025
PRISM: Perceptual Recognition for Identifying Standout Moments in Human-Centric Keyframe Extraction

Mert Can Cakmak, Nitin Agarwal, Diwash Poudel

Online videos play a central role in shaping political discourse and amplifying cyber social threats such as misinformation, propaganda, and radicalization. Detecting the most impactful or "standout" moments in video content is crucial for content moderation, summarization, and forensic analysis. In this paper, we introduce PRISM (Perceptual Recognition for Identifying Standout Moments), a lightweight and perceptually-aligned framework for keyframe extraction. PRISM operates in the CIELAB color space and uses perceptual color difference metrics to identify frames that align with human visual sensitivity. Unlike deep learning-based approaches, PRISM is interpretable, training-free, and computationally efficient, making it well suited for real-time and resource-constrained environments. We evaluate PRISM on four benchmark datasets: BBC, TVSum, SumMe, and ClipShots, and demonstrate that it achieves strong accuracy and fidelity while maintaining high compression ratios. These results highlight PRISM's effectiveness in both structured and unstructured video content, and its potential as a scalable tool for analyzing and moderating harmful or politically sensitive media in online platforms.