CVSep 18, 2023

Selective Volume Mixup for Video Action Recognition

Yi Tan, Zhaofan Qiu, Yanbin Hao, Ting Yao, Tao Mei

arXiv:2309.09534v23.95 citationsh-index: 55Has Code

Originality Incremental advance

AI Analysis

This addresses overfitting in video action recognition for researchers and practitioners working with limited training data, representing an incremental improvement over existing image-based augmentation methods.

The paper tackles overfitting in video action recognition on small datasets by proposing Selective Volume Mixup (SV-Mix), a novel video augmentation strategy that mixes informative volumes from two videos, resulting in improved generalization and consistent performance boosts across benchmarks for CNN-based and transformer-based models.

The recent advances in Convolutional Neural Networks (CNNs) and Vision Transformers have convincingly demonstrated high learning capability for video action recognition on large datasets. Nevertheless, deep models often suffer from the overfitting effect on small-scale datasets with a limited number of training videos. A common solution is to exploit the existing image augmentation strategies for each frame individually including Mixup, Cutmix, and RandAugment, which are not particularly optimized for video data. In this paper, we propose a novel video augmentation strategy named Selective Volume Mixup (SV-Mix) to improve the generalization ability of deep models with limited training videos. SV-Mix devises a learnable selective module to choose the most informative volumes from two videos and mixes the volumes up to achieve a new training video. Technically, we propose two new modules, i.e., a spatial selective module to select the local patches for each spatial position, and a temporal selective module to mix the entire frames for each timestamp and maintain the spatial pattern. At each time, we randomly choose one of the two modules to expand the diversity of training samples. The selective modules are jointly optimized with the video action recognition framework to find the optimal augmentation strategy. We empirically demonstrate the merits of the SV-Mix augmentation on a wide range of video action recognition benchmarks and consistently boot the performances of both CNN-based and transformer-based models.

View on arXiv PDF Code

Similar