SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations
This addresses the problem of assessing social intelligence in LLMs for developers and researchers, offering a novel benchmark but with incremental improvements in evaluation methods.
The paper tackles the challenge of evaluating large language models (LLMs) in realistic social interactions by introducing SI-Bench, a benchmark based on 2,221 authentic human dialogues, and finds that SOTA models surpass human experts in process reasoning but lag in reply quality, with CoT reasoning degrading performance.
As large language models (LLMs) develop anthropomorphic abilities, they are increasingly being deployed as autonomous agents to interact with humans. However, evaluating their performance in realistic and complex social interactions remains a significant challenge. Most previous research built datasets through simulated agent-to-agent interactions, which fails to capture the authentic linguistic styles and relational dynamics found in real human conversations. To address this gap, we introduce SI-Bench, a novel benchmark designed to evaluate aspects of social intelligence in LLMs. Grounded in broad social science theories, SI-Bench contains 2,221 authentic multi-turn dialogues collected from a social networking application. We further selected a subset of 312 dialogues for manual annotation across 8 major models. The experiments show that SOTA models have surpassed the human expert in process reasoning under complex social situations, yet they still fall behind humans in reply quality. Moreover, introducing Chain-of-Thought (CoT) reasoning may degrade the performance of LLMs in social dialogue tasks. All datasets are openly available at https://github.com/SI-Bench/SI-Bench.git.