How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment
This addresses the challenge of assessing reasoning capabilities in LLMs for AI researchers, but it is incremental as it applies existing methods to a new testbed.
The paper tackled the problem of evaluating multi-step deductive reasoning in LLMs using a text-based version of Clue, finding that agents achieved only four correct wins out of 18 games and that fine-tuning did not reliably improve performance.
Deducing whodunit proves challenging for LLM agents. In this paper, we implement a text-based multi-agent version of the classic board game Clue as a rule-based testbed for evaluating multi-step deductive reasoning, with six agents drawn from GPT-4o-mini and Gemini-2.5-Flash. We further investigate whether fine-tuning on structured logic puzzles transfers to improved in-game reasoning and gameplay. Across 18 simulated games, agents achieve only four correct wins, indicating difficulty in maintaining consistent deductive reasoning over the course of a full game. Additionally, we find that fine-tuning does not reliably improve performance and, in some cases, appears to increase reasoning volume without improving reasoning precision.