AICLMar 17

How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

arXiv:2603.1716961.6h-index: 2
AI Analysis

This addresses the challenge of assessing reasoning capabilities in LLMs for AI researchers, but it is incremental as it applies existing methods to a new testbed.

The paper tackled the problem of evaluating multi-step deductive reasoning in LLMs using a text-based version of Clue, finding that agents achieved only four correct wins out of 18 games and that fine-tuning did not reliably improve performance.

Deducing whodunit proves challenging for LLM agents. In this paper, we implement a text-based multi-agent version of the classic board game Clue as a rule-based testbed for evaluating multi-step deductive reasoning, with six agents drawn from GPT-4o-mini and Gemini-2.5-Flash. We further investigate whether fine-tuning on structured logic puzzles transfers to improved in-game reasoning and gameplay. Across 18 simulated games, agents achieve only four correct wins, indicating difficulty in maintaining consistent deductive reasoning over the course of a full game. Additionally, we find that fine-tuning does not reliably improve performance and, in some cases, appears to increase reasoning volume without improving reasoning precision.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes