CLAICYSep 30, 2025

RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity

arXiv:2509.25897v13 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the need to assess LLMs' behavior in complex, ambiguous social situations for applications in human decision-making, though it is incremental as it builds on existing social ability evaluations by focusing on contextual sensitivity.

The authors tackled the problem of evaluating large language models' (LLMs) contextual sensitivity in ambiguous social dilemmas by introducing RoleConflictBench, a benchmark with over 13K role conflict scenarios, and found that LLMs show insufficient sensitivity, with decisions heavily biased toward specific roles, such as a dominant preference for Family and Occupation domains, male roles, and Abrahamic religions.

Humans often encounter role conflicts -- social dilemmas where the expectations of multiple roles clash and cannot be simultaneously fulfilled. As large language models (LLMs) become increasingly influential in human decision-making, understanding how they behave in complex social situations is essential. While previous research has evaluated LLMs' social abilities in contexts with predefined correct answers, role conflicts represent inherently ambiguous social dilemmas that require contextual sensitivity: the ability to recognize and appropriately weigh situational cues that can fundamentally alter decision priorities. To address this gap, we introduce RoleConflictBench, a novel benchmark designed to evaluate LLMs' contextual sensitivity in complex social dilemmas. Our benchmark employs a three-stage pipeline to generate over 13K realistic role conflict scenarios across 65 roles, systematically varying their associated expectations (i.e., their responsibilities and obligations) and situational urgency levels. By analyzing model choices across 10 different LLMs, we find that while LLMs show some capacity to respond to these contextual cues, this sensitivity is insufficient. Instead, their decisions are predominantly governed by a powerful, inherent bias related to social roles rather than situational information. Our analysis quantifies these biases, revealing a dominant preference for roles within the Family and Occupation domains, as well as a clear prioritization of male roles and Abrahamic religions across most evaluatee models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes