AIMay 19, 2025

Correspondence of high-dimensional emotion structures elicited by video clips between humans and Multimodal LLMs

arXiv:2505.12746v23 citationsh-index: 16Sci Rep
Originality Incremental advance
AI Analysis

This addresses the challenge of accurately modeling complex human emotions for applications in AI-human interaction, though it is incremental in showing current MLLMs' partial capabilities.

The study examined whether multimodal large language models (MLLMs) can capture the high-dimensional structure of human emotions elicited by video clips, finding strong similarity at the category level but limitations at the single-item level.

Recent studies have revealed that human emotions exhibit a high-dimensional, complex structure. A full capturing of this complexity requires new approaches, as conventional models that disregard high dimensionality risk overlooking key nuances of human emotions. Here, we examined the extent to which the latest generation of rapidly evolving Multimodal Large Language Models (MLLMs) capture these high-dimensional, intricate emotion structures, including capabilities and limitations. Specifically, we compared self-reported emotion ratings from participants watching videos with model-generated estimates (e.g., Gemini or GPT). We evaluated performance not only at the individual video level but also from emotion structures that account for inter-video relationships. At the level of simple correlation between emotion structures, our results demonstrated strong similarity between human and model-inferred emotion structures. To further explore whether the similarity between humans and models is at the signle item level or the coarse-categorical level, we applied Gromov Wasserstein Optimal Transport. We found that although performance was not necessarily high at the strict, single-item level, performance across video categories that elicit similar emotions was substantial, indicating that the model could infer human emotional experiences at the category level. Our results suggest that current state-of-the-art MLLMs broadly capture the complex high-dimensional emotion structures at the category level, as well as their apparent limitations in accurately capturing entire structures at the single-item level.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes