CVJun 23, 2025

Context Consistency Learning via Sentence Removal for Semi-Supervised Video Paragraph Grounding

Yaokun Zhong, Siyu Jiang, Jian Zhu, Jian-Fang Hu

arXiv:2506.18476v1h-index: 3ICME

Originality Incremental advance

AI Analysis

This addresses the challenge of localizing multiple sentences in videos with limited annotations, offering an incremental improvement over prior methods.

The paper tackles the problem of semi-supervised video paragraph grounding by proposing a Context Consistency Learning framework that unifies consistency regularization and pseudo-labeling, resulting in outperforming existing methods by a large margin.

Semi-Supervised Video Paragraph Grounding (SSVPG) aims to localize multiple sentences in a paragraph from an untrimmed video with limited temporal annotations. Existing methods focus on teacher-student consistency learning and video-level contrastive loss, but they overlook the importance of perturbing query contexts to generate strong supervisory signals. In this work, we propose a novel Context Consistency Learning (CCL) framework that unifies the paradigms of consistency regularization and pseudo-labeling to enhance semi-supervised learning. Specifically, we first conduct teacher-student learning where the student model takes as inputs strongly-augmented samples with sentences removed and is enforced to learn from the adequately strong supervisory signals from the teacher model. Afterward, we conduct model retraining based on the generated pseudo labels, where the mutual agreement between the original and augmented views' predictions is utilized as the label confidence. Extensive experiments show that CCL outperforms existing methods by a large margin.

View on arXiv PDF

Similar