SEMar 31
WybeCoder: Verified Imperative Code GenerationFabian Gloeckle, Mantas Baksys, Darius Feher et al.
Recent progress in large language models (LLMs) has advanced automatic code generation and formal theorem proving, yet software verification has not seen the same improvement. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development where code, invariants, and proofs co-evolve. It builds on a recent framework that combines automatic verification condition generation and SMT solvers with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, to equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements by scaling our approach, synthesizing dozens of valid invariants and dispatching of dozens of subgoals, resulting in hundreds of lines of verified code, overcoming plateaus reported in previous works. Our best system solves 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, significantly surpassing previous evaluations and paving a path to automated construction of large-scale datasets of verified imperative code.
QMJun 7, 2023
Neural Embeddings for Protein GraphsFrancesco Ceccarelli, Lorenzo Giusti, Sean B. Holden et al.
Proteins perform much of the work in living organisms, and consequently the development of efficient computational methods for protein representation is essential for advancing large-scale biological research. Most current approaches struggle to efficiently integrate the wealth of information contained in the protein sequence and structure. In this paper, we propose a novel framework for embedding protein graphs in geometric vector spaces, by learning an encoder function that preserves the structural distance between protein graphs. Utilizing Graph Neural Networks (GNNs) and Large Language Models (LLMs), the proposed framework generates structure- and sequence-aware protein representations. We demonstrate that our embeddings are successful in the task of comparing protein structures, while providing a significant speed-up compared to traditional approaches based on structural alignment. Our framework achieves remarkable results in the task of protein structure classification; in particular, when compared to other work, the proposed method shows an average F1-Score improvement of 26% on out-of-distribution (OOD) samples and of 32% when tested on samples coming from the same distribution as the training data. Our approach finds applications in areas such as drug prioritization, drug re-purposing, disease sub-type analysis and elsewhere.
LGDec 11, 2025Code
MINIF2F-DAFNY: LLM-Guided Mathematical Theorem Proving via Auto-Active VerificationMantas Baksys, Stefan Zetzsche, Olivier Bouissou et al.
LLMs excel at reasoning, but validating their steps remains challenging. Formal verification offers a solution through mechanically checkable proofs. Interactive theorem provers (ITPs) dominate mathematical reasoning but require detailed low-level proof steps, while auto-active verifiers offer automation but focus on software verification. Recent work has begun bridging this divide by evaluating LLMs for software verification in ITPs, but the complementary direction--LLMs for mathematical theorem proving in auto-active verifiers--remains unexplored. We present MINIF2F-DAFNY, the first translation of the widely-used mathematical benchmark miniF2F to an auto-active verifier: Dafny. We find that Dafny's automation alone solves 39-44% of problems with empty proofs, whereas many require substantial proof guidance in ITPs. For remaining problems, we evaluate 7 off-the-shelf LLMs, achieving 55.7% success with the best model (Claude Sonnet 4.5) using modest resources. These results demonstrate effective division of labor: LLMs provide high-level guidance while automation handles low-level details. Our benchmark can be found on GitHub at http://github.com/dafny-lang/miniF2F .
LGOct 3, 2025Code
ContextFlow: Context-Aware Flow Matching For Trajectory Inference From Spatial Omics DataSantanu Subhash Rathod, Francesco Ceccarelli, Sean B. Holden et al.
Inferring trajectories from longitudinal spatially-resolved omics data is fundamental to understanding the dynamics of structural and functional tissue changes in development, regeneration and repair, disease progression, and response to treatment. We propose ContextFlow, a novel context-aware flow matching framework that incorporates prior knowledge to guide the inference of structural tissue dynamics from spatially resolved omics data. Specifically, ContextFlow integrates local tissue organization and ligand-receptor communication patterns into a transition plausibility matrix that regularizes the optimal transport objective. By embedding these contextual constraints, ContextFlow generates trajectories that are not only statistically consistent but also biologically meaningful, making it a generalizable framework for modeling spatiotemporal dynamics from longitudinal, spatially resolved omics data. Evaluated on three datasets, ContextFlow consistently outperforms state-of-the-art flow matching methods across multiple quantitative and qualitative metrics of inference accuracy and biological coherence. Our code is available at: \href{https://github.com/santanurathod/ContextFlow}{ContextFlow}
AIFeb 25, 2022
Building a 3-Player Mahjong AI using Deep Reinforcement LearningXiangyu Zhao, Sean B. Holden
Mahjong is a popular multi-player imperfect-information game developed in China in the late 19th-century, with some very challenging features for AI research. Sanma, being a 3-player variant of the Japanese Riichi Mahjong, possesses unique characteristics including fewer tiles and, consequently, a more aggressive playing style. It is thus challenging and of great research interest in its own right, but has not yet been explored. In this paper, we present Meowjong, an AI for Sanma using deep reinforcement learning. We define an informative and compact 2-dimensional data structure for encoding the observable information in a Sanma game. We pre-train 5 convolutional neural networks (CNNs) for Sanma's 5 actions -- discard, Pon, Kan, Kita and Riichi, and enhance the major action's model, namely the discard model, via self-play reinforcement learning using the Monte Carlo policy gradient method. Meowjong's models achieve test accuracies comparable with AIs for 4-player Mahjong through supervised learning, and gain a significant further enhancement from reinforcement learning. Being the first ever AI in Sanma, we claim that Meowjong stands as a state-of-the-art in this game.
CLFeb 4, 2020
Structural Inductive Biases in Emergent CommunicationAgnieszka Słowik, Abhinav Gupta, William L. Hamilton et al.
In order to communicate, humans flatten a complex representation of ideas and their attributes into a single word or a sentence. We investigate the impact of representation learning in artificial agents by developing graph referential games. We empirically show that agents parametrized by graph neural networks develop a more compositional language compared to bag-of-words and sequence models, which allows them to systematically generalize to new combinations of familiar features.
LGJan 24, 2020
Towards Graph Representation Learning in Emergent CommunicationAgnieszka Słowik, Abhinav Gupta, William L. Hamilton et al.
Recent findings in neuroscience suggest that the human brain represents information in a geometric structure (for instance, through conceptual spaces). In order to communicate, we flatten the complex representation of entities and their attributes into a single word or a sentence. In this paper we use graph convolutional networks to support the evolution of language and cooperation in multi-agent systems. Motivated by an image-based referential game, we propose a graph referential game with varying degrees of complexity, and we provide strong baseline models that exhibit desirable properties in terms of language emergence and cooperation. We show that the emerged communication protocol is robust, that the agents uncover the true factors of variation in the game, and that they learn to generalize beyond the samples encountered during training.
AISep 18, 2019
Bayesian Optimisation with Gaussian Processes for Premise SelectionAgnieszka Słowik, Chaitanya Mangla, Mateja Jamnik et al.
Heuristics in theorem provers are often parameterised. Modern theorem provers such as Vampire utilise a wide array of heuristics to control the search space explosion, thereby requiring optimisation of a large set of parameters. An exhaustive search in this multi-dimensional parameter space is intractable in most cases, yet the performance of the provers is highly dependent on the parameter assignment. In this work, we introduce a principled probablistic framework for heuristics optimisation in theorem provers. We present results using a heuristic for premise selection and The Archive of Formal Proofs (AFP) as a case study.