Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding
This work addresses the challenge of reducing annotation costs for localizing video segments based on language descriptions, representing an incremental improvement over existing weakly supervised methods.
The paper tackles the problem of weakly supervised temporal language grounding, where only video-level descriptions are available, by proposing a candidate-free framework called Fine-grained Semantic Alignment Network (FSAN) that learns token-by-clip cross-modal alignment, achieving state-of-the-art performance on benchmarks like ActivityNet-Captions and DiDeMo.
Temporal language grounding (TLG) aims to localize a video segment in an untrimmed video based on a natural language description. To alleviate the expensive cost of manual annotations for temporal boundary labels, we are dedicated to the weakly supervised setting, where only video-level descriptions are provided for training. Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework. However, the temporal structure of the video as well as the complicated semantics in the sentence are lost during the learning. In this work, we propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG. Instead of view the sentence and candidate moments as a whole, FSAN learns token-by-clip cross-modal semantic alignment by an iterative cross-modal interaction module, generates a fine-grained cross-modal semantic alignment map, and performs grounding directly on top of the map. Extensive experiments are conducted on two widely-used benchmarks: ActivityNet-Captions, and DiDeMo, where our FSAN achieves state-of-the-art performance.