SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
For researchers and developers of LLM-based mediation systems, this benchmark provides a more realistic and reliable evaluation method, revealing that current LLMs struggle with social adaptation to diverse conditions.
SoCRATES introduces a benchmark for evaluating proactive LLM mediators across eight domains and five socio-cognitive axes, using a topic-localized evaluator that achieves 0.82 alignment with human experts. Benchmarking eight frontier LLMs shows the best mediator closes only about a third of the unmediated consensus gap, with performance varying sharply by socio-cognitive axis.
Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.