Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models
For researchers evaluating chess-trained LLMs, this work reveals that reported benchmark gains may be due to memorization rather than reasoning, and proposes a verifier-in-the-loop approach as a more flexible and efficient solution.
The paper shows that chess-trained language models like KinGPT (25M parameters) achieve high benchmark scores primarily through pattern-matching, not genuine understanding, and demonstrates that an LLM-Modulo framework with a verifier improves RedPajama 3B's best move accuracy from 1.2% to 21.2% and move validity from 19.3% to 95.3%, offering a cost-effective alternative to fine-tuning on synthetic data.
Recent work has fine-tuned language models on chess data and reported high benchmark scores as evidence that the resulting models can understand the rules of chess, play full chess games at a professional level, or generate human-readable explanations grounded in expert knowledge. We train KinGPT, a 25M-parameter character-level language model trained only on (position, best-move) pairs, who exceeds 3B-parameter ChessGPT on a 600-puzzle mate-in-N suite and 4B-parameter C1-4B over a 20-theme puzzle benchmark. We examine several claims made in existing literature regarding chess-trained language models and assert that their impressive benchmark performance is largely explained by pattern-matching. We also demonstrate how LLM-Modulo, a verifier-in-the-loop framework, raises RedPajama 3B's best move accuracy from 1.2% to 21.2% and move generation validity from 19.3% to 95.3% on mate-in-N chess puzzles, comparable to gains achieved from ChessGPT's fine-tuning on chess-specific web corpora at a fraction of the cost. Our results illustrate how pairing a general LLM with an external verifier offers a more flexible alternative to directly training on synthetic data for well-defined domains. We open source all training/evaluation code, datasets, puzzle samples, and KinGPT model checkpoints for reproducibility.