CVSep 23, 2021

Self-supervised Learning for Semi-supervised Temporal Language Grounding

arXiv:2109.11475v215 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of expensive manual annotations in video understanding for researchers and practitioners, though it is incremental as it builds on existing semi-supervised and self-supervised techniques.

The paper tackles the problem of Temporal Language Grounding (TLG) with limited annotations by proposing a semi-supervised method that incorporates self-supervised learning, achieving competitive performance compared to fully-supervised state-of-the-art methods while using only a small portion of temporal annotations.

Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video. TLG is inherently a challenging task, as it requires comprehensive understanding of both sentence semantics and video contents. Previous works either tackle this task in a fully-supervised setting that requires a large amount of temporal annotations or in a weakly-supervised setting that usually cannot achieve satisfactory performance. Since manual annotations are expensive, to cope with limited annotations, we tackle TLG in a semi-supervised way by incorporating self-supervised learning, and propose Self-Supervised Semi-Supervised Temporal Language Grounding (S^4TLG). S^4TLG consists of two parts: (1) A pseudo label generation module that adaptively produces instant pseudo labels for unlabeled samples based on predictions from a teacher model; (2) A self-supervised feature learning module with inter-modal and intra-modal contrastive losses to learn video feature representations under the constraints of video content consistency and video-text alignment. We conduct extensive experiments on the ActivityNet-CD-OOD and Charades-CD-OOD datasets. The results demonstrate that our proposed S^4TLG can achieve competitive performance compared to fully-supervised state-of-the-art methods while only requiring a small portion of temporal annotations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes