CVJul 18, 2023

Human Action Recognition in Still Images Using ConViT

Seyed Rohollah Hosseyni, Sanaz Seyedin, Hasan Taheri

arXiv:2307.08994v35.05 citationsh-index: 9

Originality Incremental advance

AI Analysis

This addresses the challenge of recognizing human actions in static images for computer vision applications, representing an incremental improvement over existing methods.

The paper tackled the problem of human action recognition in still images by proposing a new module that combines Vision Transformer (ViT) with convolutional neural networks (CNNs) to extract relationships between image parts, achieving 95.5% mAP on Stanford40 and 91.5% mAP on PASCAL VOC 2012 datasets.

Understanding the relationship between different parts of an image is crucial in a variety of applications, including object recognition, scene understanding, and image classification. Despite the fact that Convolutional Neural Networks (CNNs) have demonstrated impressive results in classifying and detecting objects, they lack the capability to extract the relationship between different parts of an image, which is a crucial factor in Human Action Recognition (HAR). To address this problem, this paper proposes a new module that functions like a convolutional layer that uses Vision Transformer (ViT). In the proposed model, the Vision Transformer can complement a convolutional neural network in a variety of tasks by helping it to effectively extract the relationship among various parts of an image. It is shown that the proposed model, compared to a simple CNN, can extract meaningful parts of an image and suppress the misleading parts. The proposed model has been evaluated on the Stanford40 and PASCAL VOC 2012 action datasets and has achieved 95.5% mean Average Precision (mAP) and 91.5% mAP results, respectively, which are promising compared to other state-of-the-art methods.

View on arXiv PDF

Similar