CLFeb 24

MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents

Zhenyu Wang, Xiaofen Xing, Yirong Chen, Xiangmin Xu

arXiv:2602.21941v10.6

Originality Incremental advance

AI Analysis

This addresses the evaluation bottleneck for multimodal role-playing agents, which is incremental as it refines existing assessment methods rather than introducing a new paradigm.

The paper tackles the problem of evaluating multimodal role-playing agents by proposing MERRY, a framework that decouples semantic assessment from modality generation, introducing eight refined metrics and transforming subjective scoring into a bidirectional-evidence-finding task, with results showing that training on synthetic datasets reduces emotional consistency while real-world datasets improve it, and existing models exhibit positive-bias and performance bottlenecks in negative emotions.

Multimodal Role-Playing Agents (MRPAs) are attracting increasing attention due to their ability to deliver more immersive multimodal emotional interactions. However, existing studies still rely on pure textual benchmarks to evaluate the text responses of MRPAs, while delegating the assessment of their multimodal expressions solely to modality-synthesis metrics. This evaluation paradigm, on the one hand, entangles semantic assessment with modality generation, leading to ambiguous error attribution, and on the other hand remains constrained by the heavy reliance on human judgment. To this end, we propose MERRY, a semantically decoupled evaluation framework for assessing Multimodal Emotional and Role consistencies of Role-playing agents. This framework introduce five refined metrics for EC and three for RC. Notably, we transform the traditional subjective scoring approach into a novel bidirectional-evidence-finding task, significantly improving the human agreement of LLM-as-Judge evaluations. Based on MERRY, we conduct extensive evaluations. Our empirical results primarily reveal that: (1) Training on synthetic datasets tends to reduce emotional consistency, whereas training on real-world datasets improves it; (2) Existing models suffer from emotional templatization and simplification, exhibiting positive-bias and performance bottleneck in fine-grained negative emotions; (3) Simple prompting method strengthens the weak models but constrains the strong ones, while simple fine-tuning method suffers from poor role generalization. Codes and dataset are available.

View on arXiv PDF

Similar