SEApr 16Code
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing UtilityYining Hong, Yining She, Eunsuk Kang et al. · cmu
AI agents that interact with their environments through tools enable powerful applications, but in high-stakes business settings, unintended actions can cause unacceptable harm, such as privacy breaches and financial loss. Existing mitigations, such as training-based methods and neural guardrails, improve agent reliability but cannot provide guarantees. We study symbolic guardrails as a practical path toward strong safety and security guarantees for AI agents. Our three-part study includes a systematic review of 80 state-of-the-art agent safety and security benchmarks to identify the policies they evaluate, an analysis of which policy requirements can be guaranteed by symbolic guardrails, and an evaluation of how symbolic guardrails affect safety, security, and agent success on $τ^2$-Bench, CAR-bench, and MedAgentBench. We find that 85\% of benchmarks lack concrete policies, relying instead on underspecified high-level goals or common sense. Among the specified policies, 74\% of policy requirements can be enforced by symbolic guardrails, often using simple, low-cost mechanisms. These guardrails improve safety and security without sacrificing agent utility. Overall, our results suggest that symbolic guardrails are a practical and effective way to guarantee some safety and security requirements, especially for domain-specific AI agents. We release all codes and artifacts at https://github.com/hyn0027/agent-symbolic-guardrails.
SEJan 29, 2022Code
Using Dynamic Binary Instrumentation to Detect Failures in Robotics SoftwareDeborah S. Katz, Christopher S. Timperley, Claire Le Goues
Autonomous and Robotics Systems (ARSs) are widespread, complex, and increasingly coming into contact with the public. Many of these systems are safety-critical, and it is vital to detect software errors to protect against harm. We propose a family of novel techniques to detect unusual program executions and incorrect program behavior. We model execution behavior by collecting low-level signals at run time and using those signals to build machine learning models. These models can identify previously-unseen executions that are more likely to exhibit errors. We describe a tractable approach for collecting dynamic binary runtime signals on ARSs, allowing the systems to absorb most of the overhead from dynamic instrumentation. The architecture of ARSs is particularly well-adapted to hiding the overhead from instrumentation. We demonstrate the efficiency of these approaches on ARDUPILOT -- a popular open-source autopilot software system -- and HUSKY -- an unmanned ground vehicle -- in simulation. We instrument executions to gather data from which we build supervised machine learning models of executions and evaluate the accuracy of these models. We also analyze the amount of training data needed to develop models with various degrees of accuracy, measure the overhead added to executions that use the analysis tool, and analyze which runtime signals are most useful for detecting unusual behavior on the program under test. In addition, we analyze the effects of timing delays on the functional behavior of ARSs.
SEFeb 11, 2025
From Hazard Identification to Controller Design: Proactive and LLM-Supported Safety Engineering for ML-Powered SystemsYining Hong, Christopher S. Timperley, Christian Kästner
Machine learning (ML) components are increasingly integrated into software products, yet their complexity and inherent uncertainty often lead to unintended and hazardous consequences, both for individuals and society at large. Despite these risks, practitioners seldom adopt proactive approaches to anticipate and mitigate hazards before they occur. Traditional safety engineering approaches, such as Failure Mode and Effects Analysis (FMEA) and System Theoretic Process Analysis (STPA), offer systematic frameworks for early risk identification but are rarely adopted. This position paper advocates for integrating hazard analysis into the development of any ML-powered software product and calls for greater support to make this process accessible to developers. By using large language models (LLMs) to partially automate a modified STPA process with human oversight at critical steps, we expect to address two key challenges: the heavy dependency on highly experienced safety engineering experts, and the time-consuming, labor-intensive nature of traditional hazard analysis, which often impedes its integration into real-world development workflows. We illustrate our approach with a running example, demonstrating that many seemingly unanticipated issues can, in fact, be anticipated.
SEFeb 1, 2022
Industry Experiences with Large-Scale RefactoringJames Ivers, Robert L. Nord, Ipek Ozkaya et al.
Software refactoring plays an important role in software engineering. Developers often turn to refactoring when they want to restructure software to improve its quality without changing its external behavior. Studies show that small-scale (floss) refactoring is common in industry and can often be performed by a single developer in short sessions, even though developers do much of this work manually instead of using refactoring tools. However, some refactoring efforts are much larger in scale, requiring entire teams and months of effort, and the role of tools in these efforts is not as well studied. In this paper, we report on a survey we conducted with developers to understand large-scale refactoring, its prevalence, and how tools support it. Our results from 107 industry developers demonstrate that projects commonly go through multiple large-scale refactorings, each of which requires considerable effort. While there is often a desire to refactor, other business concerns such as developing new features often take higher priority. Our study finds that developers use several categories of tools to support large-scale refactoring and rely more heavily on general-purpose tools like IDEs than on tools designed specifically to support refactoring. Tool support varies across the different activities, with some particularly challenging activities seeing little use of tools in practice. Our study demonstrates a clear need for better large-scale refactoring tools and an opportunity for refactoring researchers to make a difference in industry. The results we summarize in this paper is one concrete step towards this goal.
PLOct 11, 2021
User-driven Design and Evaluation of Liquid Types in JavaCatarina Gamboa, Paulo Alexandre Santos, Christopher S. Timperley et al.
Bugs that are detected earlier during the development lifecycle are easier and cheaper to fix, whereas bugs that are found during production are difficult and expensive to address, and may have dire consequences. Type systems are particularly effective at identifying and preventing bugs early in the development lifecycle by causing invalid programs to result in build failure. Liquid Types are more powerful than those found in mainstream programming languages, allowing the detection of more classes of bugs. However, while Liquid Types were proposed in 2008 with their integration in ML and subsequently introduced in C (2012), Javascript(2012) and Haskell(2014) through language extensions, they have yet to become widely adopted by mainstream developers. This paper investigates how Liquid Types can be integrated in a mainstream programming language, Java, by proposing a new design that aims to lower the barrier to entry and adapts to problems that Java developers commonly encounter at runtime. To promote accessibility, we conducted a series of developer surveys to design the syntax of LiquidJava, our prototype. To evaluate the prototype's usability, we conducted a user study of 30 Java developers, concluding that users intend to use LiquidJava and that it helped to find more bugs and debug faster.
ROApr 17, 2021
GzScenic: Automatic Scene Generation for Gazebo SimulatorAfsoon Afzal, Claire Le Goues, Christopher S. Timperley
Testing robotic and cyberphysical systems in simulation require specifications of the simulated environments (i.e., scenes). The Scenic domain-specific language provides a high-level probabilistic programming language that allows users to specify scenarios for simulation. Scenic automatically generates concrete scenes that can be rendered by simulators. However, Scenic is mainly designed for autonomous vehicle simulation and does not support the most popular general-purpose simulator: Gazebo. In this work, we present GzScenic; a tool that automatically generates scenes for simulation in Gazebo. GzScenic automatically generates both the models required for running Scenic on the scenarios, and the models that Gazebo requires for running the simulation.
SEAug 3, 2020
Understanding and Improving Artifact Sharing in Software Engineering ResearchChristopher S. Timperley, Lauren Herckis, Claire Le Goues et al.
In recent years, many software engineering researchers have begun to include artifacts alongside their research papers. Ideally, artifacts, including tools, benchmarks, and data, support the dissemination of ideas, provide evidence for research claims, and serve as a starting point for future research. However, in practice, artifacts suffer from a variety of issues that prevent the realization of their full potential. To help the software engineering community realize the full potential of artifacts, we seek to understand the challenges involved in the creation, sharing, and use of artifacts. To that end, we perform a mixed-methods study including a survey of artifacts in software engineering publications, and an online survey of 153 software engineering researchers. By analyzing the perspectives of artifact creators, users, and reviewers, we identify several high-level challenges that affect the quality of artifacts including mismatched expectations between these groups, and a lack of sufficient reward for both creators and reviewers. Using Diffusion of Innovations as an analytical framework, we examine how these challenges relate to one another, and build an understanding of the factors that affect the sharing and success of artifacts. Finally, we make recommendations to improve the quality of artifacts based on our results and existing best practices.
ROApr 15, 2020
A Study on the Challenges of Using Robotics Simulators for TestingAfsoon Afzal, Deborah S. Katz, Claire Le Goues et al.
Robotics simulation plays an important role in the design, development, and verification and validation of robotic systems. Recent studies have shown that simulation may be used as a cheaper, safer, and more reliable alternative to manual, and widely used, process of field testing. This is particularly important in the context of continuous integration pipelines, where integrated automated testing is key to reducing costs while maintaining system safety. However, simulation and automated testing are not seeing the degree of widespread adoption in practice that their potential would motivate. Our goal in this paper is to develop a principled understanding of the ways developers use simulation in their process, and the challenges they face in doing so. This type of understanding can guide the development of more effective simulators and testing techniques for modern robotics development. To that end, we conduct a survey of 82 robotics developers from a diversity of backgrounds that addresses the current capabilities and limits of simulation technology in practice. We find that simulation is used by 85% of our participants for testing, and that many participants desire to use simulation as part of their test automation. We identify 10 high-level challenges that impede developers from using simulation for manual and automated testing, and general purposes. These challenges include the gap between simulation and reality, a lack of reproducibility, and considerable resource costs associated with using simulators. Finally, we outline avenues for improvement in the development of new simulators that can help simulation reach its potential as a means of verification and validation.