CVFeb 18, 2025

Contrast-Unity for Partially-Supervised Temporal Sentence Grounding

arXiv:2502.12917v15 citationsh-index: 18ICASSP
Originality Incremental advance
AI Analysis

This addresses the annotation cost issue in video understanding for researchers and practitioners, though it is incremental as it builds on existing supervised and weakly-supervised methods.

The paper tackles the problem of temporal sentence grounding in videos by introducing a partially-supervised setting that uses only short-clip annotations during training to reduce costs, achieving superior performance on Charades-STA and ActivityNet Captions datasets.

Temporal sentence grounding aims to detect event timestamps described by the natural language query from given untrimmed videos. The existing fully-supervised setting achieves great results but requires expensive annotation costs; while the weakly-supervised setting adopts cheap labels but performs poorly. To pursue high performance with less annotation costs, this paper introduces an intermediate partially-supervised setting, i.e., only short-clip is available during training. To make full use of partial labels, we specially design one contrast-unity framework, with the two-stage goal of implicit-explicit progressive grounding. In the implicit stage, we align event-query representations at fine granularity using comprehensive quadruple contrastive learning: event-query gather, event-background separation, intra-cluster compactness and inter-cluster separability. Then, high-quality representations bring acceptable grounding pseudo-labels. In the explicit stage, to explicitly optimize grounding objectives, we train one fully-supervised model using obtained pseudo-labels for grounding refinement and denoising. Extensive experiments and thoroughly ablations on Charades-STA and ActivityNet Captions demonstrate the significance of partial supervision, as well as our superior performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes