CVAILGJan 17, 2021

Regional Attention Network (RAN) for Head Pose and Fine-grained Gesture Recognition

arXiv:2101.06634v117 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of robustly recognizing subtle human gestures and actions in images, which is crucial for applications like human-computer interaction and behavioral analysis, though it appears incremental by building on existing attention and CNN methods.

The paper tackled the problem of fine-grained gesture and action recognition in monocular images by proposing a Regional Attention Network (RAN) that uses attention mechanisms to focus on discriminative semantic regions, achieving state-of-the-art results across ten datasets in head pose, driver state, and human action recognition.

Affect is often expressed via non-verbal body language such as actions/gestures, which are vital indicators for human behaviors. Recent studies on recognition of fine-grained actions/gestures in monocular images have mainly focused on modeling spatial configuration of body parts representing body pose, human-objects interactions and variations in local appearance. The results show that this is a brittle approach since it relies on accurate body parts/objects detection. In this work, we argue that there exist local discriminative semantic regions, whose "informativeness" can be evaluated by the attention mechanism for inferring fine-grained gestures/actions. To this end, we propose a novel end-to-end \textbf{Regional Attention Network (RAN)}, which is a fully Convolutional Neural Network (CNN) to combine multiple contextual regions through attention mechanism, focusing on parts of the images that are most relevant to a given task. Our regions consist of one or more consecutive cells and are adapted from the strategies used in computing HOG (Histogram of Oriented Gradient) descriptor. The model is extensively evaluated on ten datasets belonging to 3 different scenarios: 1) head pose recognition, 2) drivers state recognition, and 3) human action and facial expression recognition. The proposed approach outperforms the state-of-the-art by a considerable margin in different metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes