Social Adaptive Module for Weakly-supervised Group Activity Recognition
It addresses the problem of reducing annotation costs for group activity recognition in videos, which is incremental as it adapts existing tasks with weaker supervision.
The paper tackles weakly-supervised group activity recognition using only video-level labels, proposing a social adaptive module to identify key persons and frames, achieving comparable accuracy to strongly-supervised methods on NBA and volleyball datasets.
This paper presents a new task named weakly-supervised group activity recognition (GAR) which differs from conventional GAR tasks in that only video-level labels are available, yet the important persons within each frame are not provided even in the training data. This eases us to collect and annotate a large-scale NBA dataset and thus raise new challenges to GAR. To mine useful information from weak supervision, we present a key insight that key instances are likely to be related to each other, and thus design a social adaptive module (SAM) to reason about key persons and frames from noisy data. Experiments show significant improvement on the NBA dataset as well as the popular volleyball dataset. In particular, our model trained on video-level annotation achieves comparable accuracy to prior algorithms which required strong labels.