CVAug 12, 2024

Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

arXiv:2408.05955v110 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This work addresses a key challenge in video analysis for applications like surveillance and content indexing, though it appears incremental as it builds on existing vision-language pre-training approaches.

The paper tackles the task discrepancy problem in weakly supervised temporal action localization by aligning human action and vision-language pre-training knowledge in a probabilistic embedding space, achieving significant performance improvements over previous state-of-the-art methods.

Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Code is available at https://github.com/sejong-rcv/PVLR.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes