Michael Tautschnig

h-index23

7papers

1,967citations

Novelty55%

AI Score56

Ranked #7,322 of 194,257 authors (top 4%)#3 in LO (top 1%)

7 Papers

15.3LOMay 1Code

Large Lemma Miners: Can LLMs do Induction Proofs for Hardware?

Romy Peled, Daniel Kroening, Michael Tautschnig et al.

Large Language Models (LLMs) have shown potential for solving mathematical tasks. We show that LLMs can be utilized to generate proofs by induction for hardware verification and thereby replace some of the manual work done by Formal Verification engineers and deliver industrial value. We present a neurosymbolic approach that includes two prompting frameworks to generate candidate invariants, which are checked using a formal, symbolic tool. Our results indicate that with sufficient reprompting, LLMs are able to generate inductive arguments for mid-size open-source RTL designs. For 84% of our problem set, at least one of the prompt setups succeeded in producing a provably correct inductive argument.

18.5SEMay 7Code

Teaching LLMs Program Semantics via Symbolic Execution Traces

Jonas Bayer, Stefan Zetzsche, Olivier Bouissou et al.

We introduce an evaluation framework of 500 C verification tasks across five property types (memory safety, overflow, termination, reachability, data races) built on SV-COMP 2025, and evaluate 14 models across six families. We find that high overall accuracy masks a critical weakness: while most models reliably confirm properties hold, violation detection varies widely and degrades sharply with program length. To close this gap, we train on formal verification artifacts: running the Soteria symbolic execution engine on generic open-source C code and using the resulting traces for continued pretraining of Qwen3-8B. Just ${\sim}$3,000 bug traces combined with chain-of-thought reasoning at inference time improve violation detection by over 17 percentage points, producing one of the most balanced accuracy profiles among evaluated models. On violation detection, the trained 8B model outperforms the 4$\times$ larger Qwen3-32B without thinking and approaches it in overall accuracy. The interaction between trace training and chain-of-thought is superadditive: neither alone provides meaningful gains, but their combination does. Improvements transfer across all five property types, including ones the training traces do not target. Our 28 configurations confirm the gains stem from trace semantics, not code volume, and that trace curation and format matter.

14.7CLNov 12, 2025

A Neurosymbolic Approach to Natural Language Formalization and Verification

Sam Bayless, Stefano Buliani, Darion Cassel et al.

Large Language Models perform well at natural language interpretation and reasoning, but their inherent stochasticity limits their adoption in regulated industries like finance and healthcare that operate under strict policies. To address this limitation, we present a two-stage neurosymbolic framework that (1) uses LLMs with optional human guidance to formalize natural language policies, allowing fine-grained control of the formalization process, and (2) uses inference-time autoformalization to validate logical correctness of natural language statements against those policies. When correctness is paramount, we perform multiple redundant formalization steps at inference time, cross checking the formalizations for semantic equivalence. Our benchmarks demonstrate that our approach exceeds 99% soundness, indicating a near-zero false positive rate in identifying logical validity. Our approach produces auditable logical artifacts that substantiate the verification outcomes and can be used to improve the original text.

9.6LOJun 16

Verifying the Rust Standard Library

Byron Cook, Remi Delmas, Zyad Hassan et al.

Rust's type system prevents many classes of memory errors, yet its standard library relies heavily on unsafe code whose correctness is validated through testing, including dynamic checks under Miri, but lacks static verification. We present what is, to the best of our knowledge, the largest verification campaign reported for a software library: an open, crowdsourced effort that integrates complementary verification tools into the continuous integration of a verification repository forked from the Rust standard library. We analyze the campaign's effectiveness, discuss the practical value of machine-checked proofs for a subset of undefined behaviors (e.g., out-of-bounds access, null and dangling pointer dereferences, and use of uninitialized memory), and frame the remaining obstacles as open challenges for the formal-methods community.

9.7LOOct 31, 2024Code

Neural Model Checking

Mirco Giacobbe, Daniel Kroening, Abhinandan Pal et al.

We introduce a machine learning approach to model checking temporal logic, with application to formal hardware verification. Model checking answers the question of whether every execution of a given system satisfies a desired temporal logic specification. Unlike testing, model checking provides formal guarantees. Its application is expected standard in silicon design and the EDA industry has invested decades into the development of performant symbolic model checking algorithms. Our new approach combines machine learning and symbolic reasoning by using neural networks as formal proof certificates for linear temporal logic. We train our neural certificates from randomly generated executions of the system and we then symbolically check their validity using satisfiability solving which, upon the affirmative answer, establishes that the system provably satisfies the specification. We leverage the expressive power of neural networks to represent proof certificates as well as the fact that checking a certificate is much simpler than finding one. As a result, our machine learning procedure for model checking is entirely unsupervised, formally sound, and practically effective. We experimentally demonstrate that our method outperforms the state-of-the-art academic and commercial model checkers on a set of standard hardware designs written in SystemVerilog.

3.8SESep 15, 2015

Assisted Coverage Closure

Adam Nellis, Pascal Kesseli, Philippa Ryan Conmy et al.

The malfunction of safety-critical systems may cause damage to people and the environment. Software within those systems is rigorously designed and verified according to domain specific guidance, such as ISO26262 for automotive safety. This paper describes academic and industrial co-operation in tool development to support one of the most stringent of the requirements --- achieving full code coverage in requirements-driven testing. We present a verification workflow supported by a tool that integrates the coverage measurement tool RapiCover with the test-vector generator FShell. The tool assists closing the coverage gap by providing the engineer with test vectors that help in debugging coverage-related code quality issues and creating new test cases, as well as justifying the presence of unreachable parts of the code in order to finally achieve full effective coverage according to the required criteria. To illustrate the practical utility of the tool, we report about an application of the tool to a case study from automotive industry.

5.9LOJul 30, 2012

Software Verification for Weak Memory via Program Transformation

Jade Alglave, Daniel Kroening, Vincent Nimal et al.

Despite multiprocessors implementing weak memory models, verification methods often assume Sequential Consistency (SC), thus may miss bugs due to weak memory. We propose a sound transformation of the program to verify, enabling SC tools to perform verification w.r.t. weak memory. We present experiments for a broad variety of models (from x86/TSO to Power/ARM) and a vast range of verification tools, quantify the additional cost of the transformation and highlight the cases when we can drastically reduce it. Our benchmarks include work-queue management code from PostgreSQL.