Contextual Attention for Hand Detection in the Wild
This work addresses hand detection for applications in human-computer interaction and robotics, but it is incremental as it builds on MaskRCNN with a new attention module.
The paper tackles hand detection in unconstrained images by proposing Hand-CNN, a novel architecture that extends MaskRCNN with a contextual attention mechanism, and it outperforms existing methods on datasets like PASCAL VOC and a new benchmark.
We present Hand-CNN, a novel convolutional network architecture for detecting hand masks and predicting hand orientations in unconstrained images. Hand-CNN extends MaskRCNN with a novel attention mechanism to incorporate contextual cues in the detection process. This attention mechanism can be implemented as an efficient network module that captures non-local dependencies between features. This network module can be inserted at different stages of an object detection network, and the entire detector can be trained end-to-end. We also introduce a large-scale annotated hand dataset containing hands in unconstrained images for training and evaluation. We show that Hand-CNN outperforms existing methods on several datasets, including our hand detection benchmark and the publicly available PASCAL VOC human layout challenge. We also conduct ablation studies on hand detection to show the effectiveness of the proposed contextual attention module.