CVDec 20, 2022

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Vivek Rathod, Bryan Seybold, Sudheendra Vijayanarasimhan, Austin Myers, Xiuye Gu, Vighnesh Birodkar, David A. Ross

arXiv:2212.10596v27.38 citationsh-index: 26

Originality Incremental advance

AI Analysis

It addresses the problem of limited action classes in video detection for researchers and practitioners, offering a simple method with incremental improvements.

The paper tackles open-vocabulary temporal action detection in videos by using pretrained image-text co-embeddings, achieving performance competitive with fully-supervised models, and further improves it by ensembling with motion or audio features.

Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models. We show that the performance can be further improved by ensembling the image-text features with features encoding local motion, like optical flow based features, or other modalities, like audio. In addition, we propose a more reasonable open-vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.

View on arXiv PDF

Similar