CVAIApr 10, 2023

Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection

MicrosoftNVIDIA
arXiv:2304.04688v45 citationsh-index: 33Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of detecting actions in videos without labeled training data for computer vision researchers, but it is incremental as it builds on existing visual-language models.

The paper tackled zero-shot spatio-temporal action detection by using a pre-trained visual-language model with interaction modules and prompting to align features, achieving excellent accuracy on J-HMDB and UCF101-24 datasets.

The goal of spatial-temporal action detection is to determine the time and place where each person's action occurs in a video and classify the corresponding action category. Most of the existing methods adopt fully-supervised learning, which requires a large amount of training data, making it very difficult to achieve zero-shot learning. In this paper, we propose to utilize a pre-trained visual-language model to extract the representative image and text features, and model the relationship between these features through different interaction modules to obtain the interaction feature. In addition, we use this feature to prompt each label to obtain more appropriate text features. Finally, we calculate the similarity between the interaction feature and the text feature for each label to determine the action category. Our experiments on J-HMDB and UCF101-24 datasets demonstrate that the proposed interaction module and prompting make the visual-language features better aligned, thus achieving excellent accuracy for zero-shot spatio-temporal action detection. The code will be available at https://github.com/webber2933/iCLIP.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes