CVDec 20, 2022

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

arXiv:2212.10596v28 citationsh-index: 26
AI Analysis

It addresses the problem of limited action classes in video detection for researchers and practitioners, offering a simple method with incremental improvements.

The paper tackles open-vocabulary temporal action detection in videos by using pretrained image-text co-embeddings, achieving performance competitive with fully-supervised models, and further improves it by ensembling with motion or audio features.

Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models. We show that the performance can be further improved by ensembling the image-text features with features encoding local motion, like optical flow based features, or other modalities, like audio. In addition, we propose a more reasonable open-vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes