In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

Zeyu Tang, Sang T. Truong, Deonna Owens, Shreyas Sharma, Yibo Jacky Zhang, Brando Miranda, Sanmi Koyejo

arXiv:2605.1253088.5

Predicted impact top 38% in CL · last 90 daysOriginality Highly original

AI Analysis

For researchers and practitioners evaluating LLM fairness, this work highlights the unreliability of current benchmarks and offers a more robust evaluation method.

The paper argues that standardized-test benchmarks are unreliable for evaluating LLM fairness due to structural sensitivity to prompt variations, and proposes MAC-Fairness, a multi-agent conversational framework for in-situ behavioral evaluation. Analyzing 8 million transcripts, they find stable model-specific behavioral signatures that generalize across benchmarks.

LLM fairness should be evaluated through in-situ conversational behavior rather than standardized-test Q&A benchmarks. We show that the standardized-test paradigm can be structurally unreliable: surface-level prompt construction choices, although entirely orthogonal to the fairness question being tested, account for the majority of score variance, shift fairness conclusions in both the direction and the magnitude, and result in severe discordance in model rankings. We develop MAC-Fairness, a multi-agent conversational framework that embeds controlled variation factors into multi-round dialogue for in-situ behavior evaluation, examining how models' conversational behavior shifts when identity is varied as part of natural multi-agent interaction. Repurposing standardized-test questions as conversation seeds rather than as the evaluation instrument, we evaluate position persistence (how they hold positions, from the self-perspective) and peer receptiveness (how receptive they are to peers, from the other-perspective) across 8 million conversation transcripts spanning multiple models and identity presence configurations. In-situ behavioral evaluation reveals stable, model-specific behavioral signatures that could generalize across benchmarks differing in fairness targets and evaluation methodologies, a form of evidence the standardized-test paradigm does not offer.

View on arXiv PDF

Similar