CLAICYHCOct 1, 2025

Social Welfare Function Leaderboard: When LLM Agents Allocate Social Welfare

Peking UTencent
arXiv:2510.01164v13 citationsh-index: 23
Originality Incremental advance
AI Analysis

This addresses the risk of deploying LLMs for high-stakes societal decisions, highlighting the need for specialized benchmarks, but it is incremental as it builds on existing evaluation frameworks.

The paper tackled the problem of evaluating how large language models (LLMs) allocate societal resources by introducing the Social Welfare Function Benchmark, which tests LLMs on trade-offs between efficiency and fairness, and found that most models prioritize productivity over equality and are vulnerable to perturbations.

Large language models (LLMs) are increasingly entrusted with high-stakes decisions that affect human welfare. However, the principles and values that guide these models when distributing scarce societal resources remain largely unexamined. To address this, we introduce the Social Welfare Function (SWF) Benchmark, a dynamic simulation environment where an LLM acts as a sovereign allocator, distributing tasks to a heterogeneous community of recipients. The benchmark is designed to create a persistent trade-off between maximizing collective efficiency (measured by Return on Investment) and ensuring distributive fairness (measured by the Gini coefficient). We evaluate 20 state-of-the-art LLMs and present the first leaderboard for social welfare allocation. Our findings reveal three key insights: (i) A model's general conversational ability, as measured by popular leaderboards, is a poor predictor of its allocation skill. (ii) Most LLMs exhibit a strong default utilitarian orientation, prioritizing group productivity at the expense of severe inequality. (iii) Allocation strategies are highly vulnerable, easily perturbed by output-length constraints and social-influence framing. These results highlight the risks of deploying current LLMs as societal decision-makers and underscore the need for specialized benchmarks and targeted alignment for AI governance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes