CVNov 4, 2025

VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

arXiv:2511.02712v16 citationsh-index: 20
Originality Incremental advance
AI Analysis

This work addresses emotion analysis in videos for applications in affective computing and human-computer interaction, representing an incremental advancement by building on existing video large language models with a specialized framework and dataset.

The paper tackles the challenge of understanding complex and evolving emotions in videos by proposing VidEmo, a novel affective cues-guided reasoning framework that unifies attribute perception, expression analysis, and emotional understanding, achieving competitive performance and setting a new milestone across 15 face perception tasks.

Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes