Steering LLMs for Formal Theorem Proving
This work addresses the challenge of precise interpretation in automated theorem proving for researchers and practitioners, though it is incremental as it builds on existing methods.
The paper tackles the problem of improving formal theorem proving with LLMs by addressing ambiguous informal cues, using activation steering to adjust residual activations for better proof construction without fine-tuning, resulting in performance improvements under two decoding strategies.
Recent advances in automated theorem proving use Large Language Models (LLMs) to translate informal mathematical statements into formal proofs. However, informal cues are often ambiguous or lack strict logical structure, making it hard for models to interpret them precisely. While existing methods achieve strong performance, little is known about how LLMs internally represent informal cues, or how these influence proof generation. To address this, we explore \textit{activation steering}, an inference-time intervention that identifies linear directions in residual activations associated with informal reasoning traces and adjusts them to improve proof construction without fine-tuning. This mechanism also yields interpretable information about how reasoning is internally encoded in the activation space of LLMs. We test our method for generating formal proofs from already-formalized theorems. Our contributions are twofold: (1) a novel activation-based intervention for guiding proof synthesis in LLMs; and (2) demonstration that this intervention improves performance under two decoding strategies (sampling and best-first search) without any further training.