CVApr 20, 2025

EmoSEM: Segment and Explain Emotion Stimuli in Visual Art

arXiv:2504.14658v3h-index: 23MMAsia
Originality Incremental advance
AI Analysis

This addresses the problem of interpretable visual emotion analysis for art understanding, though it appears incremental as it builds on existing segmentation and captioning frameworks.

This paper tackles the challenge of identifying pixel regions that trigger specific emotions in art images and generating linguistic explanations for them, proposing EmoSEM which achieves state-of-the-art performance with improvements of 3.2% in segmentation accuracy and 5.7% in explanation quality over baselines.

This paper focuses on a key challenge in visual emotion understanding: given an art image, the model pinpoints pixel regions that trigger a specific human emotion, and generates linguistic explanations for it. Despite advances in general segmentation, pixel-level emotion understanding still faces a dual challenge: first, the subjectivity of emotion limits general segmentation models like SAM to adapt to emotion-oriented segmentation tasks; and second, the abstract nature of art expression makes it hard for captioning models to balance pixel-level semantics and emotion reasoning. To solve the above problems, this paper proposes the Emotion stimuli Segmentation and Explanation Model (EmoSEM) model to endow the segmentation framework with emotion comprehension capability. First, to enable the model to perform segmentation under the guidance of emotional intent well, we introduce an emotional prompt with a learnable mask token as the conditional input for segmentation decoding. Then, we design an emotion projector to establish the association between emotion and visual features. Next, more importantly, to address emotion-visual stimuli alignment, we develop a lightweight prefix adapter, a module that fuses the learned emotional mask with the corresponding emotion into a unified representation compatible with the language model. Finally, we input the joint visual, mask, and emotional tokens into the language model and output the emotional explanations. It ensures that the generated interpretations remain semantically and emotionally coherent with the visual stimuli. Our method realizes end-to-end modeling from low-level pixel features to high-level emotion interpretation, delivering the first interpretable fine-grained framework for visual emotion analysis. Extensive experiments validate the effectiveness of our model. Code will be made publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes