CL GNFeb 14, 2024

STEER: Assessing the Economic Rationality of Large Language Models

Narun Raman, Taylor Lundy, Samuel Amouyal, Yoav Levine, Kevin Leyton-Brown, Moshe Tennenholtz

arXiv:2402.09552v212.226 citationsh-index: 21ICML

Originality Incremental advance

AI Analysis

This addresses the need for a reliable methodology to evaluate LLM agents for decision-making, which is incremental as it builds on existing economic literature to create a new assessment tool.

The paper tackles the problem of assessing the economic rationality of large language models (LLMs) as decision-making agents by proposing a benchmark distribution that scores LLMs on fine-grained elements of rational behavior, resulting in a 'STEER report card' and empirical results from 14 LLMs showing current state-of-the-art performance and the impact of model size.

There is increasing interest in using LLMs as decision-making "agents." Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions -- and more broadly, determining whether an LLM agent is reliable enough to be trusted -- requires a methodology for assessing such an agent's economic rationality. In this paper, we provide one. We begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. We then propose a benchmark distribution that quantitatively scores an LLMs performance on these elements and, combined with a user-provided rubric, produces a "STEER report card." Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models' ability to exhibit rational behavior.

View on arXiv PDF

Similar