Aarti Gupta

h-index58

5papers

12citations

Novelty53%

AI Score47

Ranked #59,175 of 201,326 authors (top 29%)#647 in SE (top 19%)

5 Papers

SEFeb 10Code

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

Haoyu Zhao, Ziran Yang, Jiawei Li et al.

Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.

97.2SEMar 18

Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

Zenan Li, Ziran Yang, Deyuan et al.

Large language models (LLMs) can generate plausible code but offer limited guarantees of correctness. Formally verifying that implementations satisfy specifications requires constructing machine-checkable proofs, a task that remains beyond current automation. We propose a hierarchical proof search framework for automated code verification in Lean~4 that decomposes complex verification goals into structurally simpler subgoals before attempting tactic-level proving. Central to our approach is a principled decomposition score that combines constructive justification with structural effectiveness. Crucially, this score serves as both the training reward and the inference-time ranking criterion, ensuring strict alignment between optimization and deployment. We train Goedel-Code-Prover-8B, a single unified policy for both decomposition and completion, via supervised initialization followed by hybrid reinforcement learning, where a continuous decomposition reward drives planning exploration while supervised replay stabilizes proof generation. On three Lean-based code verification benchmarks comprising 427 tasks, our 8B-parameter model achieves a 62.0\% prove success rate, a 2.6$\times$ improvement over the strongest baseline, surpassing neural provers up to 84$\times$ larger. We further observe consistent inference-time scaling: success rates improve monotonically with search iterations and sampling budget, with our trained model achieving greater efficiency than frontier off-the-shelf models of comparable scale.

LGNov 30, 2023

Combined Scheduling, Memory Allocation and Tensor Replacement for Minimizing Off-Chip Data Accesses of DNN Accelerators

Yi Li, Aarti Gupta, Sharad Malik

Specialized hardware accelerators have been extensively used for Deep Neural Networks (DNNs) to provide power/performance benefits. These accelerators contain specialized hardware that supports DNN operators, and scratchpad memory for storing the tensor operands. Often, the size of the scratchpad is insufficient to store all the tensors needed for the computation, and additional data accesses are needed to move tensors back and forth from host memory during the computation with significant power/performance overhead. The volume of these additional data accesses depends on the operator schedule, and memory allocation (specific locations selected for the tensors in the scratchpad). We propose an optimization framework, named COSMA, for mapping DNNs to an accelerator that finds the optimal operator schedule, memory allocation and tensor replacement that minimizes the additional data accesses. COSMA provides an Integer Linear Programming (ILP) formulation to generate the optimal solution for mapping a DNN to the accelerator for a given scratchpad size. We demonstrate that, using an off-the-shelf ILP solver, COSMA obtains the optimal solution in seconds for a wide-range of state-of-the-art DNNs for different applications. Further, it out-performs existing methods by reducing on average 84% of the non-compulsory data accesses. We further propose a divide-and-conquer heuristic to scale up to certain complex DNNs generated by Neural Architecture Search, and this heuristic solution reduces on average 85% data accesses compared with other works.

35.4NIApr 4

CB-VER: A Stable Foundation for Modular Control Plane Verification

Dexin Zhang, Timothy Alberdingk Thijm, David Walker et al.

Network operators are often interested in verifying \emph{eventually-stable properties} of network control planes: properties of control plane states that hold eventually, and hold forever thereafter, provided the operating environment remains unchanged. Examples include eventually-stable reachability, access control, or path length properties. In this work, we introduce \textsc{CB-Ver}, a new framework for verifying such properties, based on the key idea of a \emph{converges-before graph} (CB-graph for short). When a user provides interfaces for each network component, \textsc{CB-Ver} checks the necessary component-by-component requirements in parallel using an SMT solver. In addition, the tool automatically synthesizes a CB-graph and checks whether it connects all nodes in a network -- if it does, the interfaces are valid and users can check whether additional eventually-stable properties are implied. Moreover, the CB-graph can then be used to determine fault tolerance properties of the network. We formalize our verification algorithm in the Lean theorem proving environment and prove its soundness. We evaluate the performance of \textsc{CB-Ver} on a range of benchmarks that demonstrate its ability to verify expressive properties in reasonable time. Finally, we demonstrate it is possible to automatically generate suitable interfaces by turning the problem around: Given a CB-graph, we use an off-the-shelf Constrained Horn Clause (CHC) solver to synthesize interfaces for every network component that together ensure the given correctness property.

SEJul 15, 2014

RTL2RTL Formal Equivalence: Boosting the Design Confidence

M V Achutha Kiran Kumar, Aarti Gupta, S S Bindumadhava

Increasing design complexity driven by feature and performance requirements and the Time to Market (TTM) constraints force a faster design and validation closure. This in turn enforces novel ways of identifying and debugging behavioral inconsistencies early in the design cycle. Addition of incremental features and timing fixes may alter the legacy design behavior and would inadvertently result in undesirable bugs. The most common method of verifying the correctness of the changed design is to run a dynamic regression test suite before and after the intended changes and compare the results, a method which is not exhaustive. Modern Formal Verification (FV) techniques involving new methods of proving Sequential Hardware Equivalence enabled a new set of solutions for the given problem, with complete coverage guarantee. Formal Equivalence can be applied for proving functional integrity after design changes resulting from a wide variety of reasons, ranging from simple pipeline optimizations to complex logic redistributions. We present here our experience of successfully applying the RTL to RTL (RTL2RTL) Formal Verification across a wide spectrum of problems on a Graphics design. The RTL2RTL FV enabled checking the design sanity in a very short time, thus enabling faster and safer design churn. The techniques presented in this paper are applicable to any complex hardware design.