CVApr 17

Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions

Ze Dong, Hao Shi, Zejia Gao, Zhonghua Yi, Kaiwei Wang, Lin Wang

arXiv:2604.1582364.2h-index: 25

AI Analysis

For embodied robotic agents and affective computing researchers, this work addresses the domain shift in movie emotion understanding from cinematic to egocentric viewing, which is a practical but overlooked problem.

The paper introduces EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding, and proposes a multimodal long-context emotion reasoning framework. Cross-domain experiments show a severe domain gap (Macro-F1 drops from 27.99 to 16.69) when models trained on cinematic footage are evaluated on egocentric views, and training on ESE improves robustness.

Embodied robotic agents often perceive movies through an egocentric screen-view interface rather than native cinematic footage, introducing domain shifts such as viewpoint distortion, scale variation, illumination changes, and environmental interference. However, existing research on movie emotion understanding is almost exclusively conducted on cinematic footage, limiting cross-domain generalization to real-world viewing scenarios. To bridge this gap, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding. ESE contains 224 movie trailers captured under controlled egocentric screen-view conditions, producing 28,667 temporally aligned key-frames annotated by multiple raters with a confidence-aware multi-label protocol to address emotional ambiguity. We further build a multimodal long-context emotion reasoning framework that models temporal visual evidence, narrative summaries, compressed historical context, and audio cues. Cross-domain experiments reveal a severe domain gap: models trained on cinematic footage drop from 27.99 to 16.69 Macro-F1 when evaluated on realistic egocentric screen-view observations. Training on ESE substantially improves robustness under realistic viewing conditions. Our approach achieves competitive performance compared with strong closed-source multimodal models, highlighting the importance of domain-specific data and long-context multimodal reasoning.

View on arXiv PDF

Similar