CYAIJan 28

Agent Benchmarks Fail Public Sector Requirements

arXiv:2601.20617v11 citationsh-index: 5
Originality Synthesis-oriented
AI Analysis

This highlights a critical gap for public sector practitioners and researchers in ensuring agents meet legal and procedural requirements.

The paper tackled the problem of evaluating LLM-based agents for public sector deployment by defining criteria that benchmarks must meet, and found that none of over 1,300 analyzed benchmarks satisfied all criteria.

Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public-sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first-principles survey of public administration literature: benchmarks must be \emph{process-based}, \emph{realistic}, \emph{public-sector-specific} and report \emph{metrics} that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert-validated LLM-assisted pipeline. Our results show that no single benchmark meets all of the criteria. Our findings provide a call to action for both researchers to develop public sector-relevant benchmarks and for public-sector officials to apply these criteria when evaluating their own agentic use cases.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes