CVNov 15, 2023

Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding

arXiv:2311.08835v433 citationsh-index: 9Has Code
Originality Incremental advance
AI Analysis

This work improves video temporal grounding for applications like video search and summarization, but it is incremental as it builds on existing transformer-based methods with specific enhancements.

The paper tackles the problem of temporal grounding in videos by addressing the equal treatment of all video clips regardless of their semantic relevance to text queries, proposing CG-DETR which uses correlation-guided cross-attention and moment-adaptive saliency detection to achieve state-of-the-art results on various benchmarks.

Temporal Grounding is to identify specific moments or highlights from a video corresponding to textual descriptions. Typical approaches in temporal grounding treat all video clips equally during the encoding process regardless of their semantic relevance with the text query. Therefore, we propose Correlation-Guided DEtection TRansformer (CG-DETR), exploring to provide clues for query-associated video clips within the cross-modal attention. First, we design an adaptive cross-attention with dummy tokens. Dummy tokens conditioned by text query take portions of the attention weights, preventing irrelevant video clips from being represented by the text query. Yet, not all words equally inherit the text query's correlation to video clips. Thus, we further guide the cross-attention map by inferring the fine-grained correlation between video clips and words. We enable this by learning a joint embedding space for high-level concepts, i.e., moment and sentence level, and inferring the clip-word correlation. Lastly, we exploit the moment-specific characteristics and combine them with the context of each video to form a moment-adaptive saliency detector. By exploiting the degrees of text engagement in each video clip, it precisely measures the highlightness of each clip. CG-DETR achieves state-of-the-art results on various benchmarks for temporal grounding. Codes are available at https://github.com/wjun0830/CGDETR.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes