LGAIApr 6

Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning

arXiv:2604.0513489.1h-index: 25Has Code
Predicted impact top 8% in LG · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the problem of enhancing reasoning in language models for chess, offering insights into training methods that could apply to other reasoning tasks, though it is incremental as it builds on existing fine-tuning and RL techniques.

The study investigated how to improve language model reasoning in chess by comparing supervised fine-tuning and reinforcement learning, finding that fine-tuning on multi-move trajectories achieved comparable performance with faithful reasoning and stable RL, while RL improved move quality and reduced hallucination rates, with a 7B-parameter model surpassing leading open-source models.

How can you get a language model to reason in a task it natively struggles with? We study how reasoning evolves in a language model -- from supervised fine-tuning (SFT) to reinforcement learning (RL) -- by analyzing how a set of theoretically-inspired datasets impacts language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance -- however, the RL step elicits unfaithful reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We show that RL induces a substantial positive shift in the distribution of move quality and reduces hallucination rates as a side effect. Finally, we find several SFT-checkpoint metrics -- metrics spanning evaluation performance, hallucination rates, and reasoning quality -- to be predictive of post-RL model performance. We release checkpoints and final models as well as training data, evaluations, and code which allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes