Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity
This work highlights critical limitations of current mLLMs for real-world political emotion analysis, providing a benchmark for future improvements.
The study evaluated multimodal large language models (mLLMs) for measuring emotional arousal in political videos, finding that while they performed near human-level in lab-created videos, they only correlated moderately with human ratings and showed systematic bias by gender and age in real-world parliamentary debates.
Research increasingly leverages audio-visual materials to analyze emotions in political communication. Multimodal large language models (mLLMs) promise to enable such analyses through in-context learning. However, we lack systematic evidence on whether these models can reliably measure emotions in real-world political settings. This paper evaluates leading mLLMs for video-based emotional arousal measurement using two complementary human-labeled video datasets: recordings created under laboratory conditions and real-world parliamentary debates. I find a critical lab-vs-field performance gap. In video created under laboratory conditions, mLLMs arousal scores approach human-level reliability with little to no demographic bias. However, in parliamentary debate recordings, all examined models' arousal scores correlate at best moderately with average human ratings and exhibit systematic bias by speaker gender and age. Neither relying on leading closed-source mLLMs nor computational noise mitigation strategies change this finding. Further, mLLMs underperform even in sentiment analysis when using video recordings instead of text transcripts of the same speeches. These findings reveal important limitations of current mLLMs for real-world political video analysis and establish a rigorous evaluation framework for tracking future developments.