CVApr 26, 2024

Two in One Go: Single-stage Emotion Recognition with Decoupled Subject-context Transformer

Xinpeng Li, Teng Wang, Jian Zhao, Shuyi Mao, Jinbao Wang, Feng Zheng, Xiaojiang Peng, Xuelong Li

arXiv:2404.17205v26.57 citationsh-index: 11MM

Originality Incremental advance

AI Analysis

This work addresses the challenge of inefficient and less interactive emotion recognition in images for computer vision applications, representing an incremental improvement over existing two-stage methods.

The paper tackles the problem of disjoint training stages and limited interaction in two-stage emotion recognition by proposing a single-stage approach using a Decoupled Subject-Context Transformer, achieving a 3.39% accuracy improvement on CAER-S and a 6.46% average precision gain on EMOTIC datasets.

Emotion recognition aims to discern the emotional state of subjects within an image, relying on subject-centric and contextual visual cues. Current approaches typically follow a two-stage pipeline: first localize subjects by off-the-shelf detectors, then perform emotion classification through the late fusion of subject and context features. However, the complicated paradigm suffers from disjoint training stages and limited interaction between fine-grained subject-context elements. To address the challenge, we present a single-stage emotion recognition approach, employing a Decoupled Subject-Context Transformer (DSCT), for simultaneous subject localization and emotion classification. Rather than compartmentalizing training stages, we jointly leverage box and emotion signals as supervision to enrich subject-centric feature learning. Furthermore, we introduce DSCT to facilitate interactions between fine-grained subject-context cues in a decouple-then-fuse manner. The decoupled query token--subject queries and context queries--gradually intertwine across layers within DSCT, during which spatial and semantic relations are exploited and aggregated. We evaluate our single-stage framework on two widely used context-aware emotion recognition datasets, CAER-S and EMOTIC. Our approach surpasses two-stage alternatives with fewer parameter numbers, achieving a 3.39% accuracy improvement and a 6.46% average precision gain on CAER-S and EMOTIC datasets, respectively.

View on arXiv PDF

Similar