CVJul 27, 2018

W-TALC: Weakly-supervised Temporal Activity Localization and Classification

arXiv:1807.10418v3338 citations
Originality Incremental advance
AI Analysis

This addresses the need for more efficient video annotation in computer vision, though it is incremental as it builds on existing weakly-supervised approaches.

The paper tackles the problem of reducing manual labeling effort for temporal activity localization by proposing W-TALC, a weakly-supervised framework using only video-level labels, which achieves better performance than state-of-the-art methods on Thumos14 and ActivityNet1.2 datasets.

Most activity localization methods in the literature suffer from the burden of frame-wise annotation requirement. Learning from weak labels may be a potential solution towards reducing such manual labeling effort. Recent years have witnessed a substantial influx of tagged videos on the Internet, which can serve as a rich source of weakly-supervised training data. Specifically, the correlations between videos with similar tags can be utilized to temporally localize the activities. Towards this goal, we present W-TALC, a Weakly-supervised Temporal Activity Localization and Classification framework using only video-level labels. The proposed network can be divided into two sub-networks, namely the Two-Stream based feature extractor network and a weakly-supervised module, which we learn by optimizing two complimentary loss functions. Qualitative and quantitative results on two challenging datasets - Thumos14 and ActivityNet1.2, demonstrate that the proposed method is able to detect activities at a fine granularity and achieve better performance than current state-of-the-art methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes