Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval
This work addresses partially relevant video retrieval, a domain-specific problem for video analysis applications, with incremental improvements in attention mechanisms and knowledge distillation.
The paper tackled the problem of retrieving partially relevant segments from untrimmed videos by addressing mismatches in information density and limited attention mechanisms, resulting in KDC-Net outperforming state-of-the-art methods on PRVR benchmarks, particularly under low moment-to-video ratios.
Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.