CVDec 6, 2018

Video Action Transformer Network

arXiv:1812.02707v2774 citations
AI Analysis

This work addresses action recognition in videos, which is important for applications like surveillance and human-computer interaction, and it represents an incremental improvement by adapting Transformer architectures to this domain.

The paper tackles the problem of recognizing and localizing human actions in video clips by introducing the Action Transformer model, which outperforms state-of-the-art methods on the Atomic Visual Actions dataset using only raw RGB frames.

We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes