AICLLGOct 18, 2023

SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents

AI2CMU
arXiv:2310.11667v2329 citationsh-index: 91
Originality Incremental advance
AI Analysis

This addresses the need for better evaluation of social intelligence in AI agents, which is crucial for developing more human-like AI, but it is incremental as it builds on existing role-play and evaluation frameworks.

The authors tackled the problem of evaluating social intelligence in AI systems by introducing SOTOPIA, an open-ended environment for simulating complex social interactions, and found that GPT-4 achieves a significantly lower goal completion rate than humans on challenging scenarios.

Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide variety of scenarios; they coordinate, collaborate, exchange, and compete with each other to achieve complex social goals. We simulate the role-play interaction between LLM-based agents and humans within this task space and evaluate their performance with a holistic evaluation framework called SOTOPIA-Eval. With SOTOPIA, we find significant differences between these models in terms of their social intelligence, and we identify a subset of SOTOPIA scenarios, SOTOPIA-hard, that is generally challenging for all models. We find that on this subset, GPT-4 achieves a significantly lower goal completion rate than humans and struggles to exhibit social commonsense reasoning and strategic communication skills. These findings demonstrate SOTOPIA's promise as a general platform for research on evaluating and improving social intelligence in artificial agents.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes