CVJun 20, 2023

How can objects help action recognition?

arXiv:2306.11726v129 citationsh-index: 151
Originality Incremental advance
AI Analysis

This work addresses the computational inefficiency in video models for action recognition, offering a method to reduce token processing without sacrificing accuracy, which is incremental but practical for real-world applications.

The paper tackles the problem of inefficient video action recognition by proposing an object-guided token sampling strategy and an object-aware attention module to process fewer tokens while maintaining or improving accuracy, achieving performance with 30-60% of input tokens on datasets and gains of 0.6 to 4.2 points when using the same token count as baselines.

Current state-of-the-art video models process a video clip as a long sequence of spatio-temporal tokens. However, they do not explicitly model objects, their interactions across the video, and instead process all the tokens in the video. In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy. This is in contrast to prior works which either drop tokens at the cost of accuracy, or increase accuracy whilst also increasing the computation required. First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens with minimal impact on accuracy. And second, we propose an object-aware attention module that enriches our feature representation with object information and improves overall accuracy. Our resulting framework achieves better performance when using fewer tokens than strong baselines. In particular, we match our baseline with 30%, 40%, and 60% of the input tokens on SomethingElse, Something-something v2, and Epic-Kitchens, respectively. When we use our model to process the same number of tokens as our baseline, we improve by 0.6 to 4.2 points on these datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes