CVMay 29, 2025

CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization

Rui Xia, Dan Jiang, Quan Zhang, Ke Zhang, Chun Yuan

arXiv:2505.23524v26.21 citationsh-index: 3ICIP

Originality Incremental advance

AI Analysis

This work addresses unsupervised temporal action localization for video analysis, representing an incremental improvement over existing methods.

The paper tackles unsupervised temporal action localization by addressing two key challenges: classification pre-trained features focusing too much on discriminative regions and visual-only methods struggling with contextual boundaries. The proposed CLIP-assisted cross-view audiovisual enhancement method achieves state-of-the-art performance on two public datasets.

Temporal Action Localization (TAL) has garnered significant attention in information retrieval. Existing supervised or weakly supervised methods heavily rely on labeled temporal boundaries and action categories, which are labor-intensive and time-consuming. Consequently, unsupervised temporal action localization (UTAL) has gained popularity. However, current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries. To address these issues, we propose a CLIP-assisted cross-view audiovisual enhanced UTAL method. Specifically, we introduce visual language pre-training (VLP) and classification pre-training-based collaborative enhancement to avoid excessive focus on highly discriminative regions; we also incorporate audio perception to provide richer contextual boundary information. Finally, we introduce a self-supervised cross-view learning paradigm to achieve multi-view perceptual enhancement without additional annotations. Extensive experiments on two public datasets demonstrate our model's superiority over several state-of-the-art competitors.

View on arXiv PDF

Similar