CVApr 28, 2025

Enhanced Partially Relevant Video Retrieval through Inter- and Intra-Sample Analysis with Coherence Prediction

arXiv:2504.19637v42 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses a domain-specific challenge in video retrieval for applications like multimedia search, but it is incremental as it builds on existing PRVR methods.

The paper tackles the problem of Partially Relevant Video Retrieval (PRVR), where videos often contain irrelevant content to text queries, by proposing a framework that exploits inter-sample correlation and intra-sample redundancy with coherence prediction, achieving state-of-the-art results.

Partially Relevant Video Retrieval (PRVR) aims to retrieve the target video that is partially relevant to the text query. The primary challenge in PRVR arises from the semantic asymmetry between textual and visual modalities, as videos often contain substantial content irrelevant to the query. Existing methods coarsely align paired videos and text queries to construct the semantic space, neglecting the critical cross-modal dual nature inherent in this task: inter-sample correlation and intra-sample redundancy. To this end, we propose a novel PRVR framework to systematically exploit these two characteristics. Our framework consists of three core modules. First, the Inter Correlation Enhancement (ICE) module captures inter-sample correlation by identifying semantically similar yet unpaired text queries and video moments, combining them to form pseudo-positive pairs for more robust semantic space construction. Second, the Intra Redundancy Mining (IRM) module mitigates intra-sample redundancy by mining redundant moment features and distinguishing them from query-relevant moments, encouraging the model to learn more discriminative representations. Finally, to reinforce these modules, we introduce the Temporal Coherence Prediction (TCP) module, which enhances discrimination of fine-grained moment-level semantics by training the model to predict the original temporal order of randomly shuffled video frames and moments. Extensive experiments demonstrate the superiority of our method, achieving state-of-the-art results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes