CL HCMay 27

The Trust Paradox: How CS Researchers Engage LLM Leaderboards

Pouya Sadeghi, Anamaria Crisan, Jimmy Lin

arXiv:2605.2896675.4h-index: 4

AI Analysis

This paper identifies a gap between leaderboard design and actual researcher practice, offering design recommendations to improve evaluation infrastructure for the CS community.

Through interviews with eight CS researchers, the study reveals a 'pragmatic skepticism' paradox: researchers distrust LLM leaderboards but still use them as rough decision aids, with peer networks and arena-based leaderboards preferred. Key findings include demand for cost transparency (7/8 participants) and varying influence across subfields.

Large language model (LLM) leaderboards rank AI models using standardized benchmarks and have become highly visible across computer science, despite known limitations in their reliability and robustness. Yet how they shape researchers' actual practice remains empirically uncharted. We address this gap through semi-structured interviews with eight researchers across four computer science subfields, analyzed using reflexive thematic analysis. We find a near-universal paradox of pragmatic skepticism: while participants expressed deep distrust of leaderboard rankings, they continued to use them as rough decision-making aids. Peer networks, not leaderboards, emerged as the primary model selection mechanism, and arena-based (human-voting) leaderboards were consistently preferred over static benchmark leaderboards. Leaderboard influence varied sharply across subfields, revealing that disciplinary culture, not individual attitudes, mediates engagement; for instance, NLP researchers faced state-of-the-art comparison pressure while HCI and Systems/Privacy researchers reported none. Across these differences, however, participants converged on cost transparency as the most demanded missing feature (seven of eight). We translate these findings into concrete design recommendations that align evaluation infrastructure with how researchers actually use it, such as task-specific score breakdowns, cost integration, and voter-demographic disclosure.

View on arXiv PDF

Similar