AICLDec 16, 2024

Codenames as a Benchmark for Large Language Models

arXiv:2412.11373v22 citationsh-index: 2IEEE Trans Game
Originality Incremental advance
AI Analysis

This provides a new benchmark for assessing AI reasoning in language-based games, but it is incremental as it builds on existing LLM evaluations.

The paper tackles the problem of evaluating reasoning capabilities in Large Language Models (LLMs) by proposing Codenames as a benchmark, and finds that while LLMs show varied performance and emergent behaviors, they are more generalizable with teammates than prior methods.

In this paper, we propose the use of the popular word-based board game Codenames as a suitable benchmark for evaluating the reasoning capabilities of Large Language Models (LLMs). Codenames presents a highly interesting challenge for achieving successful AI performance, requiring both a sophisticated understanding of language, theory of mind, and epistemic reasoning capabilities. Prior attempts to develop agents for Codenames have largely relied on word embedding techniques, which have a limited vocabulary range and perform poorly when paired with differing approaches. LLMs have demonstrated enhanced reasoning and comprehension capabilities for language-based tasks, but can still suffer in lateral thinking challenges. We evaluate the capabilities of several state-of-the-art LLMs, including GPT-4o, Gemini 1.5, Claude 3.5 Sonnet, and Llama 3.1, across a variety of board setups. Our results indicate that while certain LLMs perform better than others overall, different models exhibit varying emergent behaviours during gameplay and excel at specific roles. We also evaluate the performance of different combinations of LLMs when playing cooperatively together, demonstrating that LLM agents are more generalisable to a wider range of teammates than prior techniques.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes