LG GRMay 5

Probing Structural Mathematical Reasoning in Language Models with Algebraic Trapdoors

arXiv:2605.043528.2h-index: 1

AI Analysis

For researchers evaluating reasoning capabilities of large language models, this benchmark exposes a four-way classification of model behavior that standard answer-key scoring conflates.

The paper introduces a benchmark for evaluating structural mathematical reasoning in language models using subgroup-construction problems in SL(3, Z) that require algebraic priors. A key result shows one model demonstrated calibrated meta-cognition by abstaining after 152 minutes of reasoning when facing an undecidable membership query.

We introduce a benchmark suite for evaluating structural mathematical reasoning in language models, built on subgroup-construction problems in SL(3, Z) with cryptographic-style verifier-prover asymmetry. Each instance presents a finitely generated subgroup as a list of integer matrices and asks for an arithmetic invariant -- index, surjection-at-prime, or membership -- that the construction-time information (N, K) pins down in O(1) closed form, but that the solver, lacking that information, must derive by either Aschbacher-classification analysis or by a membership query in SL(3, Z) of unknown decidability. The benchmark therefore distinguishes models with internalized algebraic priors (Aschbacher classes, McLaughlin's theorem, Property (T), the congruence subgroup property) from models that rely on general-purpose computation. We report empirical results across five representative reasoning traces from two state-of-the-art models. The headline result: on the index variant, one model spent 152 minutes of reasoning, explicitly identified the kernel-side membership question as the bottleneck, attempted constructive verification, and abstained with "DON'T KNOW" rather than commit to its computed cokernel candidate -- demonstrating calibrated meta-cognition on the open-decidability boundary that the benchmark was designed to probe. We argue that the benchmark exposes a four-way classification of model behavior (commit-correct, commit-wrong, abstain-correct, abstain-wrong) that standard answer-key scoring conflates.

View on arXiv PDF

Similar