DBMar 4Code
SpotIt+: Verification-based Text-to-SQL Evaluation with Database ConstraintsRocky Klopfenstein, Yang He, Andrew Tremante et al.
We present SpotIt+, an open-source tool for evaluating Text-to-SQL systems via bounded equivalence verification. Given a generated SQL query and the ground truth, SpotIt+ actively searches for database instances that differentiate the two queries. To ensure that the generated counterexamples reflect practically relevant discrepancies, we introduce a constraint-mining pipeline that combines rule-based specification mining over example databases with LLM-based validation. Experimental results on the BIRD dataset show that the mined constraints enable SpotIt+ to generate more realistic differentiating databases, while preserving its ability to efficiently uncover numerous discrepancies between generated and gold SQL queries that are missed by standard test-based evaluation.
LGJun 8, 2022
Toward Certified Robustness Against Real-World Distribution ShiftsHaoze Wu, Teruhiro Tagomori, Alexander Robey et al.
We consider the problem of certifying the robustness of deep neural networks against real-world distribution shifts. To do so, we bridge the gap between hand-crafted specifications and realistic deployment settings by proposing a novel neural-symbolic verification framework, in which we train a generative model to learn perturbations from data and define specifications with respect to the output of the learned model. A unique challenge arising from this setting is that existing verifiers cannot tightly approximate sigmoid activations, which are fundamental to many state-of-the-art generative models. To address this challenge, we propose a general meta-algorithm for handling sigmoid activations which leverages classical notions of counter-example-guided abstraction refinement. The key idea is to "lazily" refine the abstraction of sigmoid functions to exclude spurious counter-examples found in the previous abstraction, thus guaranteeing progress in the verification process while keeping the state-space small. Experiments on the MNIST and CIFAR-10 datasets show that our framework significantly outperforms existing methods on a range of challenging distribution shifts.
FLOct 7, 2023
Lemur: Integrating Large Language Models in Automated Program VerificationHaoze Wu, Clark Barrett, Nina Narodytska
The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of transition rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure and demonstrate practical improvements on a set of synthetic and competition benchmarks.
AIMar 7, 2022
Scalable Verification of GNN-based Job SchedulersHaoze Wu, Clark Barrett, Mahmood Sharif et al.
Recently, Graph Neural Networks (GNNs) have been applied for scheduling jobs over clusters, achieving better performance than hand-crafted heuristics. Despite their impressive performance, concerns remain over whether these GNN-based job schedulers meet users' expectations about other important properties, such as strategy-proofness, sharing incentive, and stability. In this work, we consider formal verification of GNN-based job schedulers. We address several domain-specific challenges such as networks that are deeper and specifications that are richer than those encountered when verifying image and NLP classifiers. We develop vegas, the first general framework for verifying both single-step and multi-step properties of these schedulers based on carefully designed algorithms that combine abstractions, refinements, solvers, and proof transfer. Our experimental results show that vegas achieves significant speed-up when verifying important properties of a state-of-the-art GNN-based scheduler compared to previous methods.
LGMar 19, 2022
Efficient Neural Network Analysis with Sum-of-InfeasibilitiesHaoze Wu, Aleksandar Zeljić, Guy Katz et al.
Inspired by sum-of-infeasibilities methods in convex optimization, we propose a novel procedure for analyzing verification queries on neural networks with piecewise-linear activation functions. Given a convex relaxation which over-approximates the non-convex activation functions, we encode the violations of activation functions as a cost function and optimize it with respect to the convex relaxation. The cost function, referred to as the Sum-of-Infeasibilities (SoI), is designed so that its minimum is zero and achieved only if all the activation functions are satisfied. We propose a stochastic procedure, DeepSoI, to efficiently minimize the SoI. An extension to a canonical case-analysis-based complete search procedure can be achieved by replacing the convex procedure executed at each search state with DeepSoI. Extending the complete search with DeepSoI achieves multiple simultaneous goals: 1) it guides the search towards a counter-example; 2) it enables more informed branching decisions; and 3) it creates additional opportunities for bound derivation. An extensive evaluation across different benchmarks and solvers demonstrates the benefit of the proposed techniques. In particular, we demonstrate that SoI significantly improves the performance of an existing complete search procedure. Moreover, the SoI-based implementation outperforms other state-of-the-art complete verifiers. We also show that our technique can efficiently improve upon the perturbation bound derived by a recent adversarial attack algorithm.
LGAug 16, 2022
On Optimizing Back-Substitution Methods for Neural Network VerificationTom Zelazny, Haoze Wu, Clark Barrett et al.
With the increasing application of deep learning in mission-critical systems, there is a growing need to obtain formal guarantees about the behaviors of neural networks. Indeed, many approaches for verifying neural networks have been recently proposed, but these generally struggle with limited scalability or insufficient accuracy. A key component in many state-of-the-art verification schemes is computing lower and upper bounds on the values that neurons in the network can obtain for a specific input domain -- and the tighter these bounds, the more likely the verification is to succeed. Many common algorithms for computing these bounds are variations of the symbolic-bound propagation method; and among these, approaches that utilize a process called back-substitution are particularly successful. In this paper, we present an approach for making back-substitution produce tighter bounds. To achieve this, we formulate and then minimize the imprecision errors incurred during back-substitution. Our technique is general, in the sense that it can be integrated into numerous existing symbolic-bound propagation techniques, with only minor modifications. We implement our approach as a proof-of-concept tool, and present favorable results compared to state-of-the-art verifiers that perform back-substitution.
LGMar 3, 2023
Convex Bounds on the Softmax Function with Applications to Robustness VerificationDennis Wei, Haoze Wu, Min Wu et al.
The softmax function is a ubiquitous component at the output of neural networks and increasingly in intermediate layers as well. This paper provides convex lower bounds and concave upper bounds on the softmax function, which are compatible with convex optimization formulations for characterizing neural networks and other ML models. We derive bounds using both a natural exponential-reciprocal decomposition of the softmax as well as an alternative decomposition in terms of the log-sum-exp function. The new bounds are provably and/or numerically tighter than linear bounds obtained in previous work on robustness verification of transformers. As illustrations of the utility of the bounds, we apply them to verification of transformers as well as of the robustness of predictive uncertainty estimates of deep ensembles.
LGDec 2, 2022
VeriX: Towards Verified Explainability of Deep Neural NetworksMin Wu, Haoze Wu, Clark Barrett
We present VeriX (Verified eXplainability), a system for producing optimal robust explanations and generating counterfactuals along decision boundaries of machine learning models. We build such explanations and counterfactuals iteratively using constraint solving techniques and a heuristic based on feature-level sensitivity ranking. We evaluate our method on image recognition benchmarks and a real-world scenario of autonomous aircraft taxiing.
LGMar 16
Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMsIdo Pinto, Yizhak Yisrael Elboher, Haoze Wu et al.
The synthesis of inductive loop invariants is a critical bottleneck in automated program verification. While Large Language Models (LLMs) show promise in mitigating this issue, they often fail on hard instances, generating invariants that are invalid or computationally ineffective. While fine-tuning is a natural route to mitigate this limitation, obtaining high-quality training data for invariant generation remains an open challenge. We present a rigorous data curation pipeline designed to extract high-quality training signals from raw verifier-generated invariants. First, we formalize the properties required for a high-quality training invariant. Second, we propose Wonda, a pipeline that refines noisy data via AST-based normalization, followed by LLM-driven semantic rewriting and augmentation with provable quality guarantees. We demonstrate that fine-tuning Small Language Models (SLMs) on this curated dataset result in consistent and significant performance gain. In particular, a fine-tuned 4B parameter model matches the utility of a GPT-OSS-120B baseline and approaches the state-of-the-art GPT-5.2, without incurring reasoning-time overhead. On challenging instances from the recent InvBench evaluation suite, our approach doubles the invariant correctness and speedup rates of base models; and improves their Virtual Best Performance (VBP) rates on the verification task by up to 14.2%.
AIJul 9, 2024
Safe and Reliable Training of Learning-Based Aerospace ControllersUdayan Mandal, Guy Amir, Haoze Wu et al.
In recent years, deep reinforcement learning (DRL) approaches have generated highly successful controllers for a myriad of complex domains. However, the opaque nature of these models limits their applicability in aerospace systems and safety-critical domains, in which a single mistake can have dire consequences. In this paper, we present novel advancements in both the training and verification of DRL controllers, which can help ensure their safe behavior. We showcase a design-for-verification approach utilizing k-induction and demonstrate its use in verifying liveness properties. In addition, we also give a brief overview of neural Lyapunov Barrier certificates and summarize their capabilities on a case study. Finally, we describe several other novel reachability-based approaches which, despite failing to provide guarantees of interest, could be effective for verification of other DRL systems, and could be of further interest to the community.
DBOct 30, 2025
SpotIt: Evaluating Text-to-SQL Evaluation with Formal VerificationRocky Klopfenstein, Yang He, Andrew Tremante et al.
Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance evaluation of ten Text-to-SQL methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth. Further analysis of the verification results reveals a more complex picture of the current Text-to-SQL evaluation.
LGSep 4, 2024
Efficiently Computing Compact Formal ExplanationsMin Wu, Xiaofu Li, Haoze Wu et al.
Building on VeriX (Verified eXplainability, arXiv:2212.01051), a system for producing optimal verified explanations for machine learning models, we present VeriX+, which significantly improves both the size and the generation time of formal explanations. We introduce a bound propagation-based sensitivity technique to improve the size, and a binary search-based traversal with confidence ranking for improving time -- the two techniques are orthogonal and can be used independently or together. We also show how to adapt the QuickXplain algorithm to our setting to provide a trade-off between size and time. Experimental evaluations on standard benchmarks demonstrate significant improvements on both metrics, e.g., a size reduction of $38\%$ on the GTSRB dataset and a time reduction of $90\%$ on MNIST. We demonstrate that our approach is scalable to transformers and real-world scenarios such as autonomous aircraft taxiing and sentiment analysis. We conclude by showcasing several novel applications of formal explanations.
LONov 11, 2025
Proof Minimization in Neural Network VerificationOmri Isac, Idan Refaeli, Haoze Wu et al.
The widespread adoption of deep neural networks (DNNs) requires efficient techniques for verifying their safety. DNN verifiers are complex tools, which might contain bugs that could compromise their soundness and undermine the reliability of the verification process. This concern can be mitigated using proofs: artifacts that are checkable by an external and reliable proof checker, and which attest to the correctness of the verification process. However, such proofs tend to be extremely large, limiting their use in many scenarios. In this work, we address this problem by minimizing proofs of unsatisfiability produced by DNN verifiers. We present algorithms that remove facts which were learned during the verification process, but which are unnecessary for the proof itself. Conceptually, our method analyzes the dependencies among facts used to deduce UNSAT, and removes facts that did not contribute. We then further minimize the proof by eliminating remaining unnecessary dependencies, using two alternative procedures. We implemented our algorithms on top of a proof producing DNN verifier, and evaluated them across several benchmarks. Our results show that our best-performing algorithm reduces proof size by 37%-82% and proof checking time by 30%-88%, while introducing a runtime overhead of 7%-20% to the verification process itself.
CLJun 25, 2025Code
ReCode: Updating Code API Knowledge with Reinforcement LearningHaoze Wu, Yunzhi Yao, Wenhao Yu et al.
Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs' code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs' general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.
SEMay 14
Viverra: Text-to-Code with GuaranteesHaoze Wu, Rocky Klopfenstein, Keith Farkas et al.
A fundamental limitation of Text-to-Code is that no guarantee can be obtained about the correctness of the generated code. Therefore, to ensure its correctness, the generated code still has to be reviewed, tested, and maintained by developers. However, parsing through LLM-generated code can be tedious and time-consuming, potentially negating the productivity gains promised by AI-coding tools. To address this challenge, we present Viverra, a system that automatically produces formally verified annotations alongside generated code to aid user's understanding of the generated program. Given a natural-language task description, Viverra prompts an LLM to synthesize a C program together with candidate assertions expressing safety and correctness properties. It then verifies those assertions in a compositional and best-effort manner via a portfolio of bounded model checkers. Evaluation on 18 diverse programming tasks suggests that Viverra can efficiently generate code with verified assertions, and that these assertions improve users' performance on code-comprehension tasks in a user study with more than 400 participants.
LGDec 22, 2025
The 6th International Verification of Neural Networks Competition (VNN-COMP 2025): Summary and ResultsKonstantin Kaulen, Tobias Ladner, Stanley Bak et al.
This report summarizes the 6th International Verification of Neural Networks Competition (VNN-COMP 2025), held as a part of the 8th International Symposium on AI Verification (SAIV), that was collocated with the 37th International Conference on Computer-Aided Verification (CAV). VNN-COMP is held annually to facilitate the fair and objective comparison of state-of-the-art neural network verification tools, encourage the standardization of tool interfaces, and bring together the neural network verification community. To this end, standardized formats for networks (ONNX) and specification (VNN-LIB) were defined, tools were evaluated on equal-cost hardware (using an automatic evaluation pipeline based on AWS instances), and tool parameters were chosen by the participants before the final test sets were made public. In the 2025 iteration, 8 teams participated on a diverse set of 16 regular and 9 extended benchmarks. This report summarizes the rules, benchmarks, participating tools, results, and lessons learned from this iteration of this competition.
LGMar 25
The Luna Bound Propagator for Formal Analysis of Neural NetworksHenry LeCates, Haoze Wu
The parameterized CROWN analysis, a.k.a., alpha-CROWN, has emerged as a practically successful bound propagation method for neural network verification. However, existing implementations of alpha-CROWN are limited to Python, which complicates integration into existing DNN verifiers and long-term production-level systems. We introduce Luna, a new bound propagator implemented in C++. Luna supports Interval Bound Propagation, the CROWN analysis, and the alpha-CROWN analysis over a general computational graph. We describe the architecture of Luna and show that it is competitive with the state-of-the-art alpha-CROWN implementation in terms of both bound tightness and computational efficiency on benchmarks from VNN-COMP 2025.
LGDec 28, 2024
The Fifth International Verification of Neural Networks Competition (VNN-COMP 2024): Summary and ResultsChristopher Brix, Stanley Bak, Taylor T. Johnson et al.
This report summarizes the 5th International Verification of Neural Networks Competition (VNN-COMP 2024), held as a part of the 7th International Symposium on AI Verification (SAIV), that was collocated with the 36th International Conference on Computer-Aided Verification (CAV). VNN-COMP is held annually to facilitate the fair and objective comparison of state-of-the-art neural network verification tools, encourage the standardization of tool interfaces, and bring together the neural network verification community. To this end, standardized formats for networks (ONNX) and specification (VNN-LIB) were defined, tools were evaluated on equal-cost hardware (using an automatic evaluation pipeline based on AWS instances), and tool parameters were chosen by the participants before the final test sets were made public. In the 2024 iteration, 8 teams participated on a diverse set of 12 regular and 8 extended benchmarks. This report summarizes the rules, benchmarks, participating tools, results, and lessons learned from this iteration of this competition.
LOMar 12
Incremental Neural Network Verification via Learned ConflictsRaya Elsaleh, Liam Davis, Haoze Wu et al.
Neural network verification is often used as a core component within larger analysis procedures, which generate sequences of closely related verification queries over the same network. In existing neural network verifiers, each query is typically solved independently, and information learned during previous runs is discarded, leading to repeated exploration of the same infeasible regions of the search space. In this work, we aim to expedite verification by reducing this redundancy. We propose an incremental verification technique that reuses learned conflicts across related verification queries. The technique can be added on top of any branch-and-bound-based neural network verifier. During verification, the verifier records conflicts corresponding to learned infeasible combinations of activation phases, and retains them across runs. We formalize a refinement relation between verification queries and show that conflicts learned for a query remain valid under refinement, enabling sound conflict inheritance. Inherited conflicts are handled using a SAT solver to perform consistency checks and propagation, allowing infeasible subproblems to be detected and pruned early during search. We implement the proposed technique in the Marabou verifier and evaluate it on three verification tasks: local robustness radius determination, verification with input splitting, and minimal sufficient feature set extraction. Our experiments show that incremental conflict reuse reduces verification effort and yields speedups of up to $1.9\times$ over a non-incremental baseline.
LGDec 20, 2023
Towards Efficient Verification of Quantized Neural NetworksPei Huang, Haoze Wu, Yuting Yang et al.
Quantization replaces floating point arithmetic with integer arithmetic in deep neural network models, providing more efficient on-device inference with less power and memory. In this work, we propose a framework for formally verifying properties of quantized neural networks. Our baseline technique is based on integer linear programming which guarantees both soundness and completeness. We then show how efficiency can be improved by utilizing gradient-based heuristic search methods and also bound-propagation techniques. We evaluate our approach on perception networks quantized with PyTorch. Our results show that we can verify quantized networks with better scalability and efficiency than the previous state of the art.
AIMay 22, 2024
Formally Verifying Deep Reinforcement Learning Controllers with Lyapunov Barrier CertificatesUdayan Mandal, Guy Amir, Haoze Wu et al.
Deep reinforcement learning (DRL) is a powerful machine learning paradigm for generating agents that control autonomous systems. However, the ``black box'' nature of DRL agents limits their deployment in real-world safety-critical applications. A promising approach for providing strong guarantees on an agent's behavior is to use Neural Lyapunov Barrier (NLB) certificates, which are learned functions over the system whose properties indirectly imply that an agent behaves as desired. However, NLB-based certificates are typically difficult to learn and even more difficult to verify, especially for complex systems. In this work, we present a novel method for training and verifying NLB-based certificates for discrete-time systems. Specifically, we introduce a technique for certificate composition, which simplifies the verification of highly-complex systems by strategically designing a sequence of certificates. When jointly verified with neural network verification engines, these certificates provide a formal guarantee that a DRL agent both achieves its goals and avoids unsafe behavior. Furthermore, we introduce a technique for certificate filtering, which significantly simplifies the process of producing formally verified certificates. We demonstrate the merits of our approach with a case study on providing safety and liveness guarantees for a DRL-controlled spacecraft.
PLJan 10, 2025
Neural Network Verification is a Programming Language ChallengeLucas C. Cordeiro, Matthew L. Daggitt, Julien Girard-Satabin et al.
Neural network verification is a new and rapidly developing field of research. So far, the main priority has been establishing efficient verification algorithms and tools, while proper support from the programming language perspective has been considered secondary or unimportant. Yet, there is mounting evidence that insights from the programming language community may make a difference in the future development of this domain. In this paper, we formulate neural network verification challenges as programming language challenges and suggest possible future solutions.
LGFeb 5
Formal Synthesis of Certifiably Robust Neural Lyapunov-Barrier CertificatesChengxiao Wang, Haoze Wu, Gagandeep Singh
Neural Lyapunov and barrier certificates have recently been used as powerful tools for verifying the safety and stability properties of deep reinforcement learning (RL) controllers. However, existing methods offer guarantees only under fixed ideal unperturbed dynamics, limiting their reliability in real-world applications where dynamics may deviate due to uncertainties. In this work, we study the problem of synthesizing \emph{robust neural Lyapunov barrier certificates} that maintain their guarantees under perturbations in system dynamics. We formally define a robust Lyapunov barrier function and specify sufficient conditions based on Lipschitz continuity that ensure robustness against bounded perturbations. We propose practical training objectives that enforce these conditions via adversarial training, Lipschitz neighborhood bound, and global Lipschitz regularization. We validate our approach in two practically relevant environments, Inverted Pendulum and 2D Docking. The former is a widely studied benchmark, while the latter is a safety-critical task in autonomous systems. We show that our methods significantly improve both certified robustness bounds (up to $4.6$ times) and empirical success rates under strong perturbations (up to $2.4$ times) compared to the baseline. Our results demonstrate effectiveness of training robust neural certificates for safe RL under perturbations in dynamics.
LGAug 28, 2025
Mirage or Method? How Model-Task Alignment Induces Divergent RL ConclusionsHaoze Wu, Cheng Wang, Wenshuo Zhao et al.
Recent advances in applying reinforcement learning (RL) to large language models (LLMs) have led to substantial progress. In particular, a series of remarkable yet often counterintuitive phenomena have been reported in LLMs, exhibiting patterns not typically observed in traditional RL settings. For example, notable claims include that a single training example can match the performance achieved with an entire dataset, that the reward signal does not need to be very accurate, and that training solely with negative samples can match or even surpass sophisticated reward-based methods. However, the precise conditions under which these observations hold - and, critically, when they fail - remain unclear. In this work, we identify a key factor that differentiates RL observations: whether the pretrained model already exhibits strong Model-Task Alignment, as measured by pass@k accuracy on the evaluated task. Through a systematic and comprehensive examination of a series of counterintuitive claims, supported by rigorous experimental validation across different model architectures and task domains, our findings show that while standard RL training remains consistently robust across settings, many of these counterintuitive results arise only when the model and task already exhibit strong model-task alignment. In contrast, these techniques fail to drive substantial learning in more challenging regimes, where standard RL methods remain effective.
IRDec 14, 2024
USM: Unbiased Survey Modeling for Limiting Negative User Experiences in Recommendation SystemsChenghui Yu, Peiyi Li, Haoze Wu et al.
Reducing negative user experiences is essential for the success of recommendation platforms. Exposing users to inappropriate content could not only adversely affect users' psychological well-beings, but also potentially drive users away from the platform, sabotaging the platform's long-term success. However, recommendation algorithms tend to weigh more heavily on positive feedback signals due to the scarcity of negative ones, which may result in the neglect of valuable negative user feedback. In this paper, we propose an approach aimed at limiting negative user experiences. Our method primarily relies on distributing in-feed surveys to the users, modeling the users' feedback collected from the survey, and integrating the model predictions into the recommendation system. We further enhance the baseline survey model by integrating the Learning Hidden Unit Contributions module and the Squeeze-and-Excitation module. In addition, we strive to resolve the problem of response Bias by applying a survey-submit model; The A/B testing results indicate a reduction in survey sexual rate and survey inappropriate rate, ranging from -1.44\% to -3.9\%. Additionally, we compared our methods against an online baseline that does not incorporate our approach. The results indicate that our approach significantly reduces the report rate and dislike rate by 1\% to 2.27\% compared to the baseline, confirming the effectiveness of our methods in enhancing user experience. After we launched the survey model based our approach on our platform, the model is able to bring reductions of 1.75\%, 2.57\%, 2.06\% on reports, dislikes, survey inappropriate rate, respectively.
LOMar 15, 2025
Proof-Driven Clause Learning in Neural Network VerificationOmri Isac, Idan Refaeli, Haoze Wu et al.
The widespread adoption of deep neural networks (DNNs) requires efficient techniques for safety verification. Existing methods struggle to scale to real-world DNNs, and tremendous efforts are being put into improving their scalability. In this work, we propose an approach for improving the scalability of DNN verifiers using Conflict-Driven Clause Learning (CDCL) -- an approach that has proven highly successful in SAT and SMT solving. We present a novel algorithm for deriving conflict clauses using UNSAT proofs, and propose several optimizations for expediting it. Our approach allows a modular integration of SAT solvers and DNN verifiers, and we implement it on top of an interface designed for this purpose. The evaluation of our implementation over several benchmarks suggests a 2X--3X improvement over a similar approach, with specific cases outperforming the state of the art.
CLOct 29, 2025
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task ExecutionJunlong Li, Wenshuo Zhao, Jian Zhao et al. · cmu
Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.
PLSep 25, 2025
InvBench: Can LLMs Accelerate Program Verification with Invariant Synthesis?Anjiang Wei, Tarun Suresh, Tianran Sun et al. · stanford
Program verification relies on loop invariants, yet automatically discovering strong invariants remains a long-standing challenge. We introduce a principled framework for evaluating LLMs on invariant synthesis. Our approach uses a verifier-based decision procedure with a formal soundness guarantee and assesses not only correctness but also the speedup that invariants provide in verification. We evaluate 7 state-of-the-art LLMs, and existing LLM-based verifiers against the traditional solver UAutomizer. While LLM-based verifiers represent a promising direction, they do not yet offer a significant advantage over UAutomizer. Model capability also proves critical, as shown by sharp differences in speedups across models, and our benchmark remains an open challenge for current LLMs. Finally, we show that supervised fine-tuning and Best-of-N sampling can improve performance: fine-tuning on 3589 instances raises the percentage of speedup cases for Qwen3-Coder-480B from 8% to 29.2%, and Best-of-N sampling with N=16 improves Claude-sonnet-4 from 8.8% to 22.1%.
LOJun 11, 2025
Abstraction-Based Proof Production in Formal Verification of Neural NetworksYizhak Yisrael Elboher, Omri Isac, Guy Katz et al.
Modern verification tools for deep neural networks (DNNs) increasingly rely on abstraction to scale to realistic architectures. In parallel, proof production is becoming a critical requirement for increasing the reliability of DNN verification results. However, current proofproducing verifiers do not support abstraction-based reasoning, creating a gap between scalability and provable guarantees. We address this gap by introducing a novel framework for proof-producing abstraction-based DNN verification. Our approach modularly separates the verification task into two components: (i) proving the correctness of an abstract network, and (ii) proving the soundness of the abstraction with respect to the original DNN. The former can be handled by existing proof-producing verifiers, whereas we propose the first method for generating formal proofs for the latter. This preliminary work aims to enable scalable and trustworthy verification by supporting common abstraction techniques within a formal proof framework.
LGJun 18, 2024
GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace TheoryHaoze Wu, Zihan Qiu, Zili Wang et al.
Mixture-of-Experts (MoE) has been demonstrated as an efficient method to scale up models. By dynamically and sparsely selecting activated experts, MoE can effectively reduce computational costs. Despite the success, we observe that many tokens in the MoE models have uncertain routing results. These tokens have nearly equal scores for choosing each expert, and we demonstrate that this uncertainty can lead to incorrect selections. Inspired by the Global Workspace Theory (GWT), we propose a new fine-tuning method, GW-MoE, to address this issue. The core idea is to broadcast the uncertain tokens across experts during fine-tuning. Therefore, these tokens can acquire the necessary knowledge from any expert during inference and become less sensitive to the choice. GW-MoE does not introduce additional inference overhead. We validate that GW can mitigate the uncertain problem and consistently improve in different tasks (text classification, question answering, summarization, code generation, and mathematical problem solving) and model sizes (650M and 8B parameters).
AIJan 25, 2024
Marabou 2.0: A Versatile Formal Analyzer of Neural NetworksHaoze Wu, Omri Isac, Aleksandar Zeljić et al.
This paper serves as a comprehensive system description of version 2.0 of the Marabou framework for formal analysis of neural networks. We discuss the tool's architectural design and highlight the major features and components introduced since its initial release.
AIMay 18, 2023
Lightweight Online Learning for Sets of Related Problems in Automated ReasoningHaoze Wu, Christopher Hahn, Florian Lonsing et al.
We present Self-Driven Strategy Learning ($\textit{sdsl}$), a lightweight online learning methodology for automated reasoning tasks that involve solving a set of related problems. $\textit{sdsl}$ does not require offline training, but instead automatically constructs a dataset while solving earlier problems. It fits a machine learning model to this data which is then used to adjust the solving strategy for later problems. We formally define the approach as a set of abstract transition rules. We describe a concrete instance of the sdsl calculus which uses conditional sampling for generating data and random forests as the underlying machine learning model. We implement the approach on top of the Kissat solver and show that the combination of Kissat+$\textit{sdsl}$ certifies larger bounds and finds more counter-examples than other state-of-the-art bounded model checking approaches on benchmarks obtained from the latest Hardware Model Checking Competition.
LGMar 2, 2021
DeepCert: Verification of Contextually Relevant Robustness for Neural Network Image ClassifiersColin Paterson, Haoze Wu, John Grese et al.
We introduce DeepCert, a tool-supported method for verifying the robustness of deep neural network (DNN) image classifiers to contextually relevant perturbations such as blur, haze, and changes in image contrast. While the robustness of DNN classifiers has been the subject of intense research in recent years, the solutions delivered by this research focus on verifying DNN robustness to small perturbations in the images being classified, with perturbation magnitude measured using established Lp norms. This is useful for identifying potential adversarial attacks on DNN image classifiers, but cannot verify DNN robustness to contextually relevant image perturbations, which are typically not small when expressed with Lp norms. DeepCert addresses this underexplored verification problem by supporting:(1) the encoding of real-world image perturbations; (2) the systematic evaluation of contextually relevant DNN robustness, using both testing and formal verification; (3) the generation of contextually relevant counterexamples; and, through these, (4) the selection of DNN image classifiers suitable for the operational context (i)envisaged when a potentially safety-critical system is designed, or (ii)observed by a deployed system. We demonstrate the effectiveness of DeepCert by showing how it can be used to verify the robustness of DNN image classifiers build for two benchmark datasets (`German Traffic Sign' and `CIFAR-10') to multiple contextually relevant perturbations.
LGNov 5, 2020
An SMT-Based Approach for Verifying Binarized Neural NetworksGuy Amir, Haoze Wu, Clark Barrett et al.
Deep learning has emerged as an effective approach for creating modern software systems, with neural networks often surpassing hand-crafted systems. Unfortunately, neural networks are known to suffer from various safety and security issues. Formal verification is a promising avenue for tackling this difficulty, by formally certifying that networks are correct. We propose an SMT-based technique for verifying Binarized Neural Networks - a popular kind of neural network, where some weights have been binarized in order to render the neural network more memory and energy efficient, and quicker to evaluate. One novelty of our technique is that it allows the verification of neural networks that include both binarized and non-binarized components. Neural network verification is computationally very difficult, and so we propose here various optimizations, integrated into our SMT procedure as deduction steps, as well as an approach for parallelizing verification queries. We implement our technique as an extension to the Marabou framework, and use it to evaluate the approach on popular binarized neural network architectures.
LGOct 7, 2020
Global Optimization of Objective Functions Represented by ReLU NetworksChristopher A. Strong, Haoze Wu, Aleksandar Zeljić et al.
Neural networks can learn complex, non-convex functions, and it is challenging to guarantee their correct behavior in safety-critical contexts. Many approaches exist to find failures in networks (e.g., adversarial examples), but these cannot guarantee the absence of failures. Verification algorithms address this need and provide formal guarantees about a neural network by answering "yes or no" questions. For example, they can answer whether a violation exists within certain bounds. However, individual "yes or no" questions cannot answer qualitative questions such as "what is the largest error within these bounds"; the answers to these lie in the domain of optimization. Therefore, we propose strategies to extend existing verifiers to perform optimization and find: (i) the most extreme failure in a given input region and (ii) the minimum input perturbation required to cause a failure. A naive approach using a bisection search with an off-the-shelf verifier results in many expensive and overlapping calls to the verifier. Instead, we propose an approach that tightly integrates the optimization process into the verification procedure, achieving better runtime performance than the naive approach. We evaluate our approach implemented as an extension of Marabou, a state-of-the-art neural network verifier, and compare its performance with the bisection approach and MIPVerify, an optimization-based verifier. We observe complementary performance between our extension of Marabou and MIPVerify.
LOApr 17, 2020
Parallelization Techniques for Verifying Neural NetworksHaoze Wu, Alex Ozdemir, Aleksandar Zeljić et al.
Inspired by recent successes with parallel optimization techniques for solving Boolean satisfiability, we investigate a set of strategies and heuristics that aim to leverage parallel computing to improve the scalability of neural network verification. We introduce an algorithm based on partitioning the verification problem in an iterative manner and explore two partitioning strategies, that work by partitioning the input space or by case splitting on the phases of the neuron activations, respectively. We also introduce a highly parallelizable pre-processing algorithm that uses the neuron activation phases to simplify the neural network verification problems. An extensive experimental evaluation shows the benefit of these techniques on both existing benchmarks and new benchmarks from the aviation domain. A preliminary experiment with ultra-scaling our algorithm using a large distributed cloud-based platform also shows promising results.
LGOct 29, 2019
G2SAT: Learning to Generate SAT FormulasJiaxuan You, Haoze Wu, Clark Barrett et al.
The Boolean Satisfiability (SAT) problem is the canonical NP-complete problem and is fundamental to computer science, with a wide array of applications in planning, verification, and theorem proving. Developing and evaluating practical SAT solvers relies on extensive empirical testing on a set of real-world benchmark formulas. However, the availability of such real-world SAT formulas is limited. While these benchmark formulas can be augmented with synthetically generated ones, existing approaches for doing so are heavily hand-crafted and fail to simultaneously capture a wide range of characteristics exhibited by real-world SAT instances. In this work, we present G2SAT, the first deep generative framework that learns to generate SAT formulas from a given set of input formulas. Our key insight is that SAT formulas can be transformed into latent bipartite graph representations which we model using a specialized deep generative neural network. We show that G2SAT can generate SAT formulas that closely resemble given real-world SAT instances, as measured by both graph metrics and SAT solver behavior. Further, we show that our synthetic SAT formulas could be used to improve SAT solver performance on real-world benchmarks, which opens up new opportunities for the continued development of SAT solvers and a deeper understanding of their performance.
AIOct 30, 2017
Improve SAT-solving with Machine LearningHaoze Wu
In this project, we aimed to improve the runtime of Minisat, a Conflict-Driven Clause Learning (CDCL) solver that solves the Propositional Boolean Satisfiability (SAT) problem. We first used a logistic regression model to predict the satisfiability of propositional boolean formulae after fixing the values of a certain fraction of the variables in each formula. We then applied the logistic model and added a preprocessing period to Minisat to determine the preferable initial value (either true or false) of each boolean variable using a Monte-Carlo approach. Concretely, for each Monte-Carlo trial, we fixed the values of a certain ratio of randomly selected variables, and calculated the confidence that the resulting sub-formula is satisfiable with our logistic regression model. The initial value of each variable was set based on the mean confidence scores of the trials that started from the literals of that variable. We were particularly interested in setting the initial values of the backbone variables correctly, which are variables that have the same value in all solutions of a SAT formula. Our Monte-Carlo method was able to set 78% of the backbones correctly. Excluding the preprocessing time, compared with the default setting of Minisat, the runtime of Minisat for satisfiable formulae decreased by 23%. However, our method did not outperform vanilla Minisat in runtime, as the decrease in the conflicts was outweighed by the long runtime of the preprocessing period.
LGOct 30, 2017
Gene Ontology (GO) Prediction using Machine Learning MethodsHaoze Wu, Yangyu Zhou
We applied machine learning to predict whether a gene is involved in axon regeneration. We extracted 31 features from different databases and trained five machine learning models. Our optimal model, a Random Forest Classifier with 50 submodels, yielded a test score of 85.71%, which is 4.1% higher than the baseline score. We concluded that our models have some predictive capability. Similar methodology and features could be applied to predict other Gene Ontology (GO) terms.