Stephen N. Freund

PL
h-index37
4papers
17citations
Novelty54%
AI Score39

4 Papers

72.9PLMay 2
FlowBook: Enforcing Reproducibility in Computational Notebooks

Stephen N. Freund, Emery D. Berger, Cormac Flanagan et al.

Computational notebooks are notoriously prone to reproducibility failures. By permitting out-of-order cell execution, notebooks accumulate hidden state and implicit dependencies that cause interactive executions to silently diverge from clean top-to-bottom runs. Prior approaches either employ dependency analyses or enforce reactive dataflow models that face fundamental tradeoffs among expressiveness, precision, and performance. This paper exploits the insight that reproducibility can be enforced without precise dependency tracking: a notebook is reproducible if and only if executing its cells in top-to-bottom order from an empty store produces exactly the outputs currently recorded. We formalize this notion of reproducibility and present FlowBook, which implements a dynamic analysis that enforces reproducibility by tracking read and write sets at cell boundaries. FlowBook detects stale cells whose recorded outputs may no longer reflect the current notebook state and prevents operations that would violate reproducibility. FlowBook incurs near-imperceptible latency overhead (median: 70 ms).

SEMar 25, 2024
ChatDBG: Augmenting Debugging with Large Language Models

Kyla H. Levin, Nicolas van Kempen, Emery D. Berger et al.

Debugging is a critical but challenging task for programmers. This paper proposes ChatDBG, an AI-powered debugging assistant. ChatDBG integrates large language models (LLMs) to significantly enhance the capabilities and user-friendliness of conventional debuggers. ChatDBG lets programmers engage in a collaborative dialogue with the debugger, allowing them to pose complex questions about program state, perform root cause analysis for crashes or assertion failures, and explore open-ended queries like "why is x null?". To handle these queries, ChatDBG grants the LLM autonomy to "take the wheel": it can act as an independent agent capable of querying and controlling the debugger to navigate through stacks and inspect program state. It then reports its findings and yields back control to the programmer. By leveraging the real-world knowledge embedded in LLMs, ChatDBG can diagnose issues identifiable only through the use of domain-specific reasoning. Our ChatDBG prototype integrates with standard debuggers including LLDB and GDB for native code and Pdb for Python. Our evaluation across a diverse set of code, including C/C++ code with known bugs and a suite of Python code including standalone scripts and Jupyter notebooks, demonstrates that ChatDBG can successfully analyze root causes, explain bugs, and generate accurate fixes for a wide range of real-world errors. For the Python programs, a single query led to an actionable bug fix 67% of the time; one additional follow-up query increased the success rate to 85%. ChatDBG has seen rapid uptake; it has already been downloaded more than 75,000 times.

PLJan 3, 2025
Effective LLM-Driven Code Generation with Pythoness

Kyla H. Levin, Kyle Gwilt, Emery D. Berger et al.

The advent of large language models (LLMs) has paved the way for a new era of programming tools with both significant capabilities and risks, as the generated code lacks guarantees of correctness and reliability. Developers using LLMs currently face the difficult task of optimizing, integrating, and maintaining code generated by AI. We propose an embedded domain-specific language (DSL), Pythoness, to address those challenges. In Pythoness, developers program with LLMs at a higher level of abstraction. Rather than interacting directly with generated code, developers using Pythoness operate at the level of behavioral specifications when writing functions, classes, or an entire program. These specifications can take the form of unit tests and property-based tests, which may be expressed formally or in natural language. Guided by these specifications, Pythoness generates code that both passes the tests and can be continuously checked during execution. We posit that the Pythoness approach lets developers harness the full potential of LLMs for code generation while substantially mitigating their inherent risks. We describe our current prototype implementation of Pythoness and demonstrate that it can successfully leverage a combination of tests and code generation to yield higher quality code than specifications alone.

HCApr 18, 2025
Flowco: Rethinking Data Analysis in the Age of LLMs

Stephen N. Freund, Brooke Simon, Emery D. Berger et al.

Conducting data analysis typically involves authoring code to transform, visualize, analyze, and interpret data. Large language models (LLMs) are now capable of generating such code for simple, routine analyses. LLMs promise to democratize data science by enabling those with limited programming expertise to conduct data analyses, including in scientific research, business, and policymaking. However, analysts in many real-world settings must often exercise fine-grained control over specific analysis steps, verify intermediate results explicitly, and iteratively refine their analytical approaches. Such tasks present barriers to building robust and reproducible analyses using LLMs alone or even in conjunction with existing authoring tools (e.g., computational notebooks). This paper introduces Flowco, a new mixed-initiative system to address these challenges. Flowco leverages a visual dataflow programming model and integrates LLMs into every phase of the authoring process. A user study suggests that Flowco supports analysts, particularly those with less programming experience, in quickly authoring, debugging, and refining data analyses.