AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis
This work addresses the problem of evaluating and improving affective capabilities in VLMs for researchers, but it is incremental as it builds on existing VLM frameworks with a new benchmark and prompting method.
The paper tackled the underexplored problem of holistic Affective Image Content Analysis (AICA) in Vision-Language Models (VLMs) by introducing AICA-Bench, a benchmark with three tasks, and found that VLMs have limitations like weak intensity calibration and shallow descriptions, which were addressed with Grounded Affective Tree Prompting to reduce errors and improve depth.
Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.