LGApr 25

Surface Sensitivity in Lean 4 Autoformalization

arXiv:2604.2313571.7

Predicted impact top 23% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers working on autoformalization in Lean, the paper clarifies that the main bottleneck is compilation success rather than semantic understanding, guiding future training interventions.

The paper investigates why semantically equivalent paraphrases of theorem statements in Lean 4 produce different formal outputs, finding that sensitivity arises from compilation failures rather than semantic divergence; when both outputs compile, they are semantically equivalent and structurally near-identical.

Natural-language variation poses a key challenge in Lean autoformalization: semantically equivalent paraphrases of the same theorem statements can induce divergent formal outputs, yet it remains unclear whether this variation reflects semantic disagreements or shallower failures. We investigate this question in Lean 4 using 60 deterministic paraphrase rules applied to ProofNet\# and miniF2F. Across four GPT-family models and three open-weight 7B autoformalizers, we find that the observed paraphrase sensitivity reflects compilation-boundary failures rather than semantic divergence among successful formalizations. In particular, when both baseline and perturbed outputs compile, paired predictions are semantically equivalent under BEq+ and structurally near-identical under GTED. By contrast, paraphrasing substantially affects whether outputs compile, with failure modes varying across datasets and perturbation classes. Our results suggest that future training-time interventions should target the compile boundary rather than the semantic layer, and that benchmarks should separate compile-conditional equivalence from surface consistency.

View on arXiv PDF

Similar