CLFeb 24, 2025

AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models

Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang, Junyang Lin

arXiv:2502.16906v122.617 citationsh-index: 17Has Code

Originality Incremental advance

AI Analysis

This provides a more reliable evaluation tool for researchers and developers assessing LLMs' reasoning capabilities, though it is incremental as it builds on existing benchmark methods.

The authors tackled the problem of overestimated reasoning abilities in Large Language Models due to multiple-choice benchmarks by developing AutoLogi, an automated method for generating open-ended logic puzzles, resulting in performance scores ranging from 35% to 73% across eight models, compared to 21% to 37% on a source dataset.

While logical reasoning evaluation of Large Language Models (LLMs) has attracted significant attention, existing benchmarks predominantly rely on multiple-choice formats that are vulnerable to random guessing, leading to overestimated performance and substantial performance fluctuations. To obtain more accurate assessments of models' reasoning capabilities, we propose an automated method for synthesizing open-ended logic puzzles, and use it to develop a bilingual benchmark, AutoLogi. Our approach features program-based verification and controllable difficulty levels, enabling more reliable evaluation that better distinguishes models' reasoning abilities. Extensive evaluation of eight modern LLMs shows that AutoLogi can better reflect true model capabilities, with performance scores spanning from 35% to 73% compared to the narrower range of 21% to 37% on the source multiple-choice dataset. Beyond benchmark creation, this synthesis method can generate high-quality training data by incorporating program verifiers into the rejection sampling process, enabling systematic enhancement of LLMs' reasoning capabilities across diverse datasets.

View on arXiv PDF Code

Similar