CVMar 30, 2021

Learning Representational Invariances for Data-Efficient Action Recognition

Yuliang Zou, Jinwoo Choi, Qitong Wang, Jia-Bin Huang

arXiv:2103.16565v312.648 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of limited labeled data for video action recognition, which is important for applications like surveillance and human-computer interaction, but it is incremental as it builds on existing semi-supervised learning frameworks with new augmentation methods.

The paper tackled the problem of data-efficient action recognition in videos by investigating data augmentation strategies that capture various video invariances, such as photometric, geometric, temporal, and actor/scene augmentations, and showed improved performance on datasets like Kinetics-100/400 and UCF-101 in low-label and fully supervised settings.

Data augmentation is a ubiquitous technique for improving image classification when labeled data is scarce. Constraining the model predictions to be invariant to diverse data augmentations effectively injects the desired representational invariances to the model (e.g., invariance to photometric variations) and helps improve accuracy. Compared to image data, the appearance variations in videos are far more complex due to the additional temporal dimension. Yet, data augmentation methods for videos remain under-explored. This paper investigates various data augmentation strategies that capture different video invariances, including photometric, geometric, temporal, and actor/scene augmentations. When integrated with existing semi-supervised learning frameworks, we show that our data augmentation strategy leads to promising performance on the Kinetics-100/400, Mini-Something-v2, UCF-101, and HMDB-51 datasets in the low-label regime. We also validate our data augmentation strategy in the fully supervised setting and demonstrate improved performance.

View on arXiv PDF Code

Similar