CLAISIApr 4, 2025

How Social is It? A Benchmark for LLMs' Capabilities in Multi-user Multi-turn Social Agent Tasks

arXiv:2505.04628v12 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the problem of evaluating LLMs for complex social interactions, though it is incremental as it builds on existing benchmark methodologies.

The authors tackled the lack of systematic measurement for LLMs' capabilities in multi-user, multi-turn social agent tasks by introducing a benchmark called HSII, which evaluates communication and task completion in realistic social scenarios, showing it effectively assesses social skills.

Expanding the application of large language models (LLMs) to societal life, instead of primary function only as auxiliary assistants to communicate with only one person at a time, necessitates LLMs' capabilities to independently play roles in multi-user, multi-turn social agent tasks within complex social settings. However, currently the capability has not been systematically measured with available benchmarks. To address this gap, we first introduce an agent task leveling framework grounded in sociological principles. Concurrently, we propose a novel benchmark, How Social Is It (we call it HSII below), designed to assess LLM's social capabilities in comprehensive social agents tasks and benchmark representative models. HSII comprises four stages: format parsing, target selection, target switching conversation, and stable conversation, which collectively evaluate the communication and task completion capabilities of LLMs within realistic social interaction scenarios dataset, HSII-Dataset. The dataset is derived step by step from news dataset. We perform an ablation study by doing clustering to the dataset. Additionally, we investigate the impact of chain of thought (COT) method on enhancing LLMs' social performance. Since COT cost more computation, we further introduce a new statistical metric, COT-complexity, to quantify the efficiency of certain LLMs with COTs for specific social tasks and strike a better trade-off between measurement of correctness and efficiency. Various results of our experiments demonstrate that our benchmark is well-suited for evaluating social skills in LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes