3.5PLJun 1
Type-Error Ablation and AI Coding AgentsShriram Krishnamurthi, Matthew Flatt
Programming language implementors have designed error messages with one consumer in mind: the human programmer. Human-factors research has consistently found that programmers engage with error messages poorly -- they skim, miss key information, and are easily overwhelmed. The practical consequence has been a strong design pressure toward brevity: messages should be terse enough that programmers will actually read them. AI coding agents are now a second, fundamentally different consumer of error messages. Unlike humans, agents do not tire, lose attention, or find length cognitively overwhelming. This raises a question the programming-language community has not previously had reason to ask: should error-message detail be calibrated differently for AI agents than for humans? We investigate this question through a controlled experiment using Shplait, an ML-style statically typed language. We construct a suite of programs containing a single deliberate type error each, and measure how often an AI agent repairs them under ablation: a detailed error context using the unification stack; a proximate error location; a minimal type error; and a dynamic (test suite) error only. An automated oracle uses a test suite to classify each repair attempt as a type error, semantically incorrect, or semantically correct. We find concrete evidence that more detailed error messages improve an agent's ability to fix type errors. We also find that the presence of a type system appears to help more than only test suite failure reports. As a secondary finding, in cases where an agent successfully fixes the type error, the resulting program passes all semantic tests most of the time -- lending empirical support to a widely held folk belief about typed languages. We also see evidence that leading agents are able to correctly reconstruct the meaning of programs in which all names have been obfuscated.
CLDec 4, 2024
Grounded Language Design for Lightweight Diagramming for Formal MethodsSiddhartha Prasad, Ben Greenman, Tim Nelson et al.
Model finding, as embodied by SAT solvers and similar tools, is used widely, both in embedding settings and as a tool in its own right. For instance, tools like Alloy target SAT to enable users to incrementally define, explore, verify, and diagnose sophisticated specifications for a large number of complex systems. These tools critically include a visualizer that lets users graphically explore these generated models. As we show, however, default visualizers, which know nothing about the domain, are unhelpful and even actively violate presentational and cognitive principles. At the other extreme, full-blown visualizations require significant effort as well as knowledge a specifier might not possess; they can also exhibit bad failure modes (including silent failure). Instead, we need a language to capture essential domain information for lightweight diagramming. We ground our language design in both the cognitive science literature on diagrams and on a large number of example custom visualizations. This identifies the key elements of lightweight diagrams. We distill these into a small set of orthogonal primitives. We extend an Alloy-like tool to support these primitives. We evaluate the effectiveness of the produced diagrams, finding them good for reasoning. We then compare this against many other drawing languages and tools to show that this work defines a new niche that is lightweight, effective, and driven by sound principles.
CRAug 5, 2021
Computing and Authentication Practices in Global Oil and Gas FieldsMary Rose Martinez, Shriram Krishnamurthi
Oil and gas fields are a critical part of our infrastructure, and vulnerable to attack by powerful adversaries. In addition, these are often difficult work environments, with constraints on space, clothing, and more. Yet there is little research on the technology practices and constraints of workers in these environments. We present what we believe is the first survey of oil- and gas-field workers located around the world. We establish the presence and status of a variety of computing devices and of the security practices that govern their use. We also determine the working conditions (such as personal protective equipment) under which these devices are used, which impacts usable security aspects like feasible forms of authentication. We extend these basic insights with additional information from a small number of in-depth interviews. Our preliminary work suggests many directions for improving security in this critical sector.
SEOct 30, 2020
Using Relational Problems to Teach Property-Based TestingJohn Wrenn, Tim Nelson, Shriram Krishnamurthi
Context: The success of QuickCheck has led to the development of property-based testing (PBT) libraries for many languages and the process is getting increasing attention. However, unlike regular testing, PBT is not widespread in collegiate curricula. Furthermore, the value of PBT is not limited to software testing. The growing use of formal methods in, and the growth of software synthesis, all create demand for techniques to train students and developers in the art of specification writing. We posit that PBT forms a strong bridge between testing and the act of specification: it's a form of testing where the tester is actually writing abstract specifications. Inquiry: Even well-informed technologists mention the difficulty of finding good motivating examples for its use. We take steps to fill this lacuna. Approach & Knowledge: We find that the use of "relational" problems -- those for which an input may admit multiple valid outputs -- easily motivates the use of PBT. We also notice that such problems are readily available in the computer science pantheon of problems (e.g., many graph and sorting algorithms). We have been using these for some years now to teach PBT in collegiate courses. Grounding: In this paper, we describe the problems we use and report on students? completion of them. We believe the problems overcome some of the motivation issues described above. We also show that students can do quite well at PBT for these problems, suggesting that the topic is well within their reach. In the process, we introduce a simple method to evaluate the accuracy of their specifications, and use it to characterize their common mistakes. Importance: Based on our findings, we believe that relational problems are an underutilized motivating example for PBT. We hope this paper initiates a catalog of such problems for educators (and developers) to use, and also provides a concrete (though by no means exclusive) method to analyze the quality of PBT.