AIMay 14

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

arXiv:2605.1453743.8
Predicted impact top 71% in AI · last 90 daysOriginality Incremental advance
AI Analysis

Provides a unified test for multi-capability strategic reasoning in LLMs, revealing specific failure modes not captured by isolated benchmarks.

Cattle Trade is a multi-agent benchmark testing LLMs on strategic reasoning combining auctions, bargaining, bluffing, and resource allocation. Over 242 games, heuristic code agents outperformed most LLMs, which exhibited failures like overbidding and weak opponent adaptation.

We introduce \textsc{Cattle Trade, a multi-agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark combines auctions, hidden-offer trade challenges (TCs), bargaining, bluffing, opponent modeling, and resource allocation within a single long-horizon game lasting 50--60 turns. Unlike prior agent benchmarks that test these abilities in isolation, \textsc{Cattle Trade} evaluates whether agents integrate them across a competitive, multi-agent economic game with conflicting incentives. The benchmark logs every bid, TC offer, counteroffer, and card selection, enabling behavioural analysis beyond final scores or win rates. We evaluate seven cost-efficient language models and three deterministic code agents across 242 games. Strategic coherence, in particular spending efficiency, resource discipline, and phase-adaptive bidding, is associated with rank more strongly than spending volume or any single subskill. Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation. Evaluating agentic competence requires benchmarks that test the joint deployment of multiple capabilities in multi-agent environments with conflicting incentives, uncertainty, and economic dynamics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes