CVJan 23, 2025

Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models

arXiv:2501.13795v13 citationsh-index: 4
Originality Highly original
AI Analysis

This addresses the challenge of domain shifts and high computational costs in real-world video analysis, offering a more practical solution for detecting unseen activities.

The paper tackles the problem of zero-shot temporal action detection in videos by proposing a training-free method that uses vision-language models to classify and localize unseen activities without fine-tuning, achieving state-of-the-art performance on THUMOS14 and ActivityNet-1.3 datasets with only 1/13 of the runtime compared to unsupervised methods.

Existing zero-shot temporal action detection (ZSTAD) methods predominantly use fully supervised or unsupervised strategies to recognize unseen activities. However, these training-based methods are prone to domain shifts and require high computational costs, which hinder their practical applicability in real-world scenarios. In this paper, unlike previous works, we propose a training-Free Zero-shot temporal Action Detection (FreeZAD) method, leveraging existing vision-language (ViL) models to directly classify and localize unseen activities within untrimmed videos without any additional fine-tuning or adaptation. We mitigate the need for explicit temporal modeling and reliance on pseudo-label quality by designing the LOGarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based Actionness Calibration. Furthermore, we introduce a test-time adaptation (TTA) strategy using Prototype-Centric Sampling (PCS) to expand FreeZAD, enabling ViL models to adapt more effectively for ZSTAD. Extensive experiments on the THUMOS14 and ActivityNet-1.3 datasets demonstrate that our training-free method outperforms state-of-the-art unsupervised methods while requiring only 1/13 of the runtime. When equipped with TTA, the enhanced method further narrows the gap with fully supervised methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes