CVApr 6, 2025

Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

arXiv:2504.04572v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses video retrieval challenges for lengthy content, though it appears incremental as it builds on existing multimodal approaches.

The paper tackles the problem of retrieving lengthy videos by developing a multimodal framework combining visual and aural matching streams with subtitle-based segmentation, and introduces a new evaluation metric for long-video retrieval. Experiments on the YouCook2 benchmark show promising results.

Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. Additionally, the aural stream includes a complementary audio-based two-stage retrieval mechanism that enhances performance on long-duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long-video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes