AIJun 23, 2025

Baba is LLM: Reasoning in a Game with Dynamic Rules

arXiv:2506.19095v11 citationsh-index: 28
Originality Synthesis-oriented
AI Analysis

This addresses the problem of LLMs' reasoning limitations in dynamic rule-based environments for AI researchers, though it is incremental as it applies existing methods to a new game domain.

The paper investigates how well large language models (LLMs) can play the puzzle game Baba is You, which requires manipulating rules through text blocks, finding that larger models like GPT-4o perform better in reasoning and puzzle-solving, while smaller or unadapted models struggle with game mechanics and rule changes, and finetuning improves analysis but not solution formulation.

Large language models (LLMs) are known to perform well on language tasks, but struggle with reasoning tasks. This paper explores the ability of LLMs to play the 2D puzzle game Baba is You, in which players manipulate rules by rearranging text blocks that define object properties. Given that this rule-manipulation relies on language abilities and reasoning, it is a compelling challenge for LLMs. Six LLMs are evaluated using different prompt types, including (1) simple, (2) rule-extended and (3) action-extended prompts. In addition, two models (Mistral, OLMo) are finetuned using textual and structural data from the game. Results show that while larger models (particularly GPT-4o) perform better in reasoning and puzzle solving, smaller unadapted models struggle to recognize game mechanics or apply rule changes. Finetuning improves the ability to analyze the game levels, but does not significantly improve solution formulation. We conclude that even for state-of-the-art and finetuned LLMs, reasoning about dynamic rule changes is difficult (specifically, understanding the use-mention distinction). The results provide insights into the applicability of LLMs to complex problem-solving tasks and highlight the suitability of games with dynamically changing rules for testing reasoning and reflection by LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes