AICLMar 18, 2024

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

Peking UTencent
arXiv:2403.11807v769 citationsh-index: 26Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation of LLMs' gaming and decision-making capabilities for AI researchers, though it appears incremental by extending existing game theory approaches to multi-agent settings.

The authors tackled the problem of evaluating Large Language Models' decision-making abilities in multi-agent environments by introducing GAMA-Bench, a new framework with eight game theory scenarios and a dynamic scoring scheme. Their results show that Gemini-1.5-Pro outperformed other models with a score of 69.8 out of 100, while GPT-3.5 demonstrated strong robustness but limited generalizability.

Decision-making is a complex process requiring diverse abilities, making it an excellent framework for evaluating Large Language Models (LLMs). Researchers have examined LLMs' decision-making through the lens of Game Theory. However, existing evaluation mainly focus on two-player scenarios where an LLM competes against another. Additionally, previous benchmarks suffer from test set leakage due to their static design. We introduce GAMA($γ$)-Bench, a new framework for evaluating LLMs' Gaming Ability in Multi-Agent environments. It includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to quantitatively assess LLMs' performance. $γ$-Bench allows flexible game settings and adapts the scoring system to different game parameters, enabling comprehensive evaluation of robustness, generalizability, and strategies for improvement. Our results indicate that GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought. We also evaluate 13 LLMs from 6 model families, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. Gemini-1.5-Pro outperforms others, scoring of $69.8$ out of $100$, followed by LLaMA-3.1-70B ($65.9$) and Mixtral-8x22B ($62.4$). Our code and experimental results are publicly available at https://github.com/CUHK-ARISE/GAMABench.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes