CVOct 4, 2023

Reinforcement Learning-based Mixture of Vision Transformers for Video Violence Recognition

Hamid Mohammadi, Ehsan Nazerfard, Tahereh Firoozi

arXiv:2310.03108v12.84 citationsh-index: 12

Originality Incremental advance

AI Analysis

This addresses the problem of accurate and scalable violence recognition in videos for surveillance or content moderation, representing an incremental improvement over existing methods.

The paper tackled video violence recognition by proposing a novel transformer-based Mixture of Experts system that combines large and efficient vision transformers with a reinforcement learning-based router, achieving 92.4% accuracy on the RWF dataset.

Video violence recognition based on deep learning concerns accurate yet scalable human violence recognition. Currently, most state-of-the-art video violence recognition studies use CNN-based models to represent and categorize videos. However, recent studies suggest that pre-trained transformers are more accurate than CNN-based models on various video analysis benchmarks. Yet these models are not thoroughly evaluated for video violence recognition. This paper introduces a novel transformer-based Mixture of Experts (MoE) video violence recognition system. Through an intelligent combination of large vision transformers and efficient transformer architectures, the proposed system not only takes advantage of the vision transformer architecture but also reduces the cost of utilizing large vision transformers. The proposed architecture maximizes violence recognition system accuracy while actively reducing computational costs through a reinforcement learning-based router. The empirical results show the proposed MoE architecture's superiority over CNN-based models by achieving 92.4% accuracy on the RWF dataset.

View on arXiv PDF

Similar