SocialBench: Sociality Evaluation of Role-Playing Conversational Agents
This provides a testbed for evaluating social interactions in AI agents, addressing a gap in prior research, but it is incremental as it focuses on benchmarking rather than new methods.
The authors tackled the lack of assessment for social intelligence in role-playing conversational agents by introducing SocialBench, a benchmark that evaluates sociality at individual and group levels, finding that agents excelling individually may not perform well in groups and that individual behavior can drift due to group influence.
Large language models (LLMs) have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce SocialBench, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on SocialBench confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at https://github.com/X-PLUG/SocialBench.