CVApr 29, 2021

Actor-centered Representations for Action Localization in Streaming Videos

arXiv:2104.14131v24 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of scaling event perception for real-world applications by enabling unsupervised action localization, though it is incremental in its method.

The paper tackles the problem of localizing actions in streaming videos by learning actor-centered representations through hierarchical predictive learning, achieving competitive performance with supervised approaches using only one epoch of training.

Event perception tasks such as recognizing and localizing actions in streaming videos are essential for scaling to real-world application contexts. We tackle the problem of learning actor-centered representations through the notion of continual hierarchical predictive learning to localize actions in streaming videos without the need for training labels and outlines for the objects in the video. We propose a framework driven by the notion of hierarchical predictive learning to construct actor-centered features by attention-based contextualization. The key idea is that predictable features or objects do not attract attention and hence do not contribute to the action of interest. Experiments on three benchmark datasets show that the approach can learn robust representations for localizing actions using only one epoch of training, i.e., a single pass through the streaming video. We show that the proposed approach outperforms unsupervised and weakly supervised baselines while offering competitive performance to fully supervised approaches. Additionally, we extend the model to multi-actor settings to recognize group activities while localizing the multiple, plausible actors. We also show that it generalizes to out-of-domain data with limited performance degradation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes