CVDec 11, 2019

Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition

Jinwoo Choi, Chen Gao, Joseph C. E. Messou, Jia-Bin Huang

arXiv:1912.05534v125.3210 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the generalization issue in action recognition models for video analysis, though it is incremental as it builds on existing debiasing techniques.

The paper tackles the problem of scene bias in video action recognition, where models rely on contextual scenes rather than discriminative cues, by proposing a method that adds adversarial and human mask confusion losses to mitigate this bias, resulting in consistent improvements across action classification, temporal localization, and spatio-temporal action detection tasks.

Human activities often occur in specific scene contexts, e.g., playing basketball on a basketball court. Training a model using existing video datasets thus inevitably captures and leverages such bias (instead of using the actual discriminative cues). The learned representation may not generalize well to new action classes or different tasks. In this paper, we propose to mitigate scene bias for video representation learning. Specifically, we augment the standard cross-entropy loss for action classification with 1) an adversarial loss for scene types and 2) a human mask confusion loss for videos where the human actors are masked out. These two losses encourage learning representations that are unable to predict the scene types and the correct actions when there is no evidence. We validate the effectiveness of our method by transferring our pre-trained model to three different tasks, including action classification, temporal localization, and spatio-temporal action detection. Our results show consistent improvement over the baseline model without debiasing.

View on arXiv PDF Code

Similar