CVNov 18, 2021

Evaluating Transformers for Lightweight Action Recognition

arXiv:2111.09641v28 citations
AI Analysis

This work addresses the efficiency gap in action recognition for researchers with limited hardware, though it is incremental as it benchmarks existing methods rather than proposing new ones.

The paper tackled the problem of video transformers being too heavyweight for lightweight action recognition, benchmarking 13 models across 3 datasets and 10 devices, and found that composite transformers perform best but still lag behind traditional convolutional baselines in efficiency.

In video action recognition, transformers consistently reach state-of-the-art accuracy. However, many models are too heavyweight for the average researcher with limited hardware resources. In this work, we explore the limitations of video transformers for lightweight action recognition. We benchmark 13 video transformers and baselines across 3 large-scale datasets and 10 hardware devices. Our study is the first to evaluate the efficiency of action recognition models in depth across multiple devices and train a wide range of video transformers under the same conditions. We categorize current methods into three classes and show that composite transformers that augment convolutional backbones are best at lightweight action recognition, despite lacking accuracy. Meanwhile, attention-only models need more motion modeling capabilities and stand-alone attention block models currently incur too much latency overhead. Our experiments conclude that current video transformers are not yet capable of lightweight action recognition on par with traditional convolutional baselines, and that the previously mentioned shortcomings need to be addressed to bridge this gap. Code to reproduce our experiments will be made publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes