On Compositional Learning Behaviours in Formal Mathematics
For researchers in automated theorem proving, this work identifies a key cognitive skill (CLB) that is required for top-tier performance, but also shows that current models lack sufficiency, highlighting a bottleneck for advancing AI in formal mathematics.
The paper introduces S2B-LM, a benchmark adaptation to measure compositional learning behaviors (CLBs) in formal mathematics, and finds that CLB competency is necessary but not sufficient for achieving high performance on Olympiad-level theorem proving (miniF2F >75%), with statistical significance (p=0.004).
Self-evolving scientific agents capable of conquering the hard tail of formal mathematics require Compositional Learning Behaviours (CLBs) -- the capacity to ground and recombine novel symbolic structures in context, beyond mere recombination of prelearned atoms. We propose \textbf{S2B-LM}, an adaptation of the Symbolic Behaviour Benchmark that removes numerical processing as a confound and adds chain-of-thought scaffolding to elicit rather than merely probe latent CLB competency. Cross-evaluating ten Lean~4 theorem provers on CLB competency (adj-ZSCT) and miniF2F whole-proof performance, exact permutation tests establish a hierarchical necessity structure: search-heavy models cover the tractable bulk without detectable CLBs, yet every model breaking into the Olympiad-level tier (miniF2F $>75\%$) is among the five highest CLB scorers ($p=0.004$). After ruling out model scale as a confound, our results show that CLB competency is \emph{necessary but not sufficient} for the hard tail of formal mathematical verification.