SEDec 11, 2023Code
Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing InstructionsFederico Cassano, Luisa Li, Akul Sethi et al.
A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit.
41.9CRMay 3
Automated Channel Fault Analysis with TofuJacob Ginesin, Max von Hippel, Cristina Nita-Rotaru
Distributed protocols are the linchpin of the modern internet, underpinning every internet service. This has in turn motivated a massive body of research ensuring the security, reliability, and performance of distributed protocols. In these works, a wide-ranging assumption is that distributed protocols operate over faulty or attacker-controlled channels, where messages can be arbitrarily inserted, dropped, replayed, or reordered. Formal verification work targeting distributed protocols typically defines its own notion of faulty or malicious channels, then constructively proves their protocol is correct with respect to it. In this work we take a fundamentally different approach: we develop a rigorous methodology for automatically conducting channel fault analysis on distributed protocols, and we introduce Tofu, a generalizable tool that implements our methodology. Tofu provides sound, complete analysis, synthesizing channel fault-based attack traces on arbitrary linear temporal logic (LTL) protocol specifications or proving the absence of such through an exhaustive state-space search. We demonstrate the applicability of Tofu by employing it to study TCP.
CRApr 27, 2025
SAGA: A Security Architecture for Governing AI Agentic SystemsGeorgios Syros, Anshuman Suri, Jacob Ginesin et al.
Large Language Model (LLM)-based agents increasingly interact, collaborate, and delegate tasks to one another autonomously with minimal human interaction. Industry guidelines for agentic system governance emphasize the need for users to maintain comprehensive control over their agents, mitigating potential damage from malicious agents. Several proposed agentic system designs address agent identity, authorization, and delegation, but remain purely theoretical, without concrete implementation and evaluation. Most importantly, they do not provide user-controlled agent management. To address this gap, we propose SAGA, a scalable Security Architecture for Governing Agentic systems, that offers user oversight over their agents' lifecycle. In our design, users register their agents with a central entity, the Provider, that maintains agent contact information, user-defined access control policies, and helps agents enforce these policies on inter-agent communication. We introduce a cryptographic mechanism for deriving access control tokens, that offers fine-grained control over an agent's interaction with other agents, providing formal security guarantees. We evaluate SAGA on several agentic tasks, using agents in different geolocations, and multiple on-device and cloud LLMs, demonstrating minimal performance overhead with no impact on underlying task utility in a wide range of conditions. Our architecture enables secure and trustworthy deployment of autonomous agents, accelerating the responsible adoption of this technology in sensitive environments.
LGOct 24, 2025
On Uncertainty Calibration for Equivariant FunctionsEdward Berman, Jacob Ginesin, Marco Pacini et al.
Data-sparse settings such as robotic manipulation, molecular physics, and galaxy morphology classification are some of the hardest domains for deep learning. For these problems, equivariant networks can help improve modeling across undersampled parts of the input space, and uncertainty estimation can guard against overconfidence. However, until now, the relationships between equivariance and model confidence, and more generally equivariance and model calibration, has yet to be studied. Since traditional classification and regression error terms show up in the definitions of calibration error, it is natural to suspect that previous work can be used to help understand the relationship between equivariance and calibration error. In this work, we present a theory relating equivariance to uncertainty estimation. By proving lower and upper bounds on uncertainty calibration errors (ECE and ENCE) under various equivariance conditions, we elucidate the generalization limits of equivariant models and illustrate how symmetry mismatch can result in miscalibration in both classification and regression. We complement our theoretical framework with numerical experiments that clarify the relationship between equivariance and uncertainty using a variety of real and simulated datasets, and we comment on trends with symmetry mismatch, group size, and aleatoric and epistemic uncertainties.
LGOct 14, 2024
The State of Julia for Scientific Machine LearningEdward Berman, Jacob Ginesin
Julia has been heralded as a potential successor to Python for scientific machine learning and numerical computing, boasting ergonomic and performance improvements. Since Julia's inception in 2012 and declaration of language goals in 2017, its ecosystem and language-level features have grown tremendously. In this paper, we take a modern look at Julia's features and ecosystem, assess the current state of the language, and discuss its viability and pitfalls as a replacement for Python as the de-facto scientific machine learning language. We call for the community to address Julia's language-level issues that are preventing further adoption.