AI CL GTMar 24, 2025

EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments

Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski

arXiv:2503.18825v26 citationsh-index: 14

Originality Synthesis-oriented

AI Analysis

This work provides incremental benchmarks and litmus tests for researchers and developers to evaluate LLM agents in economic settings, addressing the need for standardized assessment in unknown environments.

The authors tackled the problem of evaluating LLM agents in unknown environments by developing benchmarks based on economic decision-making tasks with scalable difficulty and introducing litmus tests to quantify agent tendencies in tradeoff scenarios, resulting in new assessment tools for complex economic applications like procurement and pricing.

We develop benchmarks for LLM agents that act in, learn from, and strategize in unknown environments, the specifications of which the LLM agent must learn over time from deliberate exploration. Our benchmarks consist of decision-making tasks derived from key problems in economics. To forestall saturation, the benchmark tasks are synthetically generated with scalable difficulty levels. Additionally, we propose litmus tests, a new kind of quantitative measure for LLMs and LLM agents. Unlike benchmarks, litmus tests quantify differences in character, values, and tendencies of LLMs and LLM agents, by considering their behavior when faced with tradeoffs (e.g., efficiency versus equality) where there is no objectively right or wrong behavior. Overall, our benchmarks and litmus tests assess the abilities and tendencies of LLM agents in tackling complex economic problems in diverse settings spanning procurement, scheduling, task allocation, and pricing -- applications that should grow in importance as such agents are further integrated into the economy.

View on arXiv PDF

Similar