Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game
For AI researchers evaluating LLM reasoning capabilities, this benchmark provides a quantitative metric to distinguish genuine logical reasoning from semantic pattern matching.
The paper introduces the Obfuscated Natural Number Game to test whether LLMs' success on formal math benchmarks stems from genuine reasoning or pattern matching. Results show that reasoning models maintain accuracy under obfuscation while general models degrade, with a universal latency tax observed.
While Large Language Models have achieved notable success on formal mathematics benchmarks such as MiniF2F, it remains unclear whether these results stem from genuine logical reasoning or semantic pattern matching against pre-training data. This paper identifies Architectural Reasoning: the ability to synthesize formal proofs using exclusively local axioms and definitions within an alien math domain, as the necessary ability for future automated theorem discovery AI. We use the Obfuscated Natural Number Game, a benchmark to evaluate Architectural Reasoning. By renaming identifiers in the Natural Number Game in Lean 4, we created a zero-knowledge, closed environment. We evaluate state-of-the-art models, finding a universal latency tax where obfuscation increases inference time. The results also reveal a divergence in robustness: while general models (Claude-Sonnet-4.5, GPT-4o) suffer performance degradation, reasoning models (DeepSeek-R1, GPT-5, DeepSeek-Prover-V2) maintain the same accuracy despite the absence of semantic cues. These findings provide a quantitative metric for assessing the true capacity for mathematical reasoning.