3 Papers

22.7PLApr 23
Causality and Semantic Separation

Anna Zhang, Qinglan Luo, London Bielicke et al.

The design of scientific experiments deserves its own variation of formal verification to catch cases where scientists made important mistakes, such as forgetting to take confounding variables into account. One of the most fundamental underpinnings of science is causality, or what it means for interventions in the world to cause other outcomes, as formalized by computer scientists like Judea Pearl. However, these ideas had not previously been made rigorous to the standards of the programming-languages community, where one expects a (syntactic) program analysis to be proved sound with respect to a natural semantics. In the domain of causality, as the relevant "program analysis," we focus on $d$-separation, a classic condition on graphs that can be used to decide when the design of an experiment controls for sufficiently many confounding variables, even though the reason that this condition works is often unintuitive. Our central result (mechanized in Rocq) is that $d$-separation exactly coincides with a novel semantic definition inspired by noninterference from the theory of security. This characterization provides a structural semantic foundation for $d$-separation and helps explain why the graph-theoretic condition is correct, independently of probabilistic assumptions. For each given automated test on the quality of an experiment design, our theorem justifies an associated method for falsifying the world-modeling hypothesis behind the experiment.

23.2HCApr 14
PLanet: Formalizing and Analyzing Assignment Procedures in the Design of Experiments

London Bielicke, Anna Zhang, Shruti Tyagi et al.

Experimental designs reflect assumptions about variable relationships that determine what causal queries researchers can answer through the experiment. Accounting for and communicating these assumptions is essential for drawing valid, generalizable conclusions from scientific experiments. Unfortunately, existing experimental design tools elide these details, expecting researchers to reason about design decisions and assumptions on their own. To surface assumptions and enable design exploration, we introduce a grammar of composable operators for constructing experimental assignment procedures grounded in matrix algebra. The PLanet DSL implements this grammar and compiles PLanet programs into constraint satisfaction problems over matrices. Together, PLanet's composable grammar and matrix representation enable a static analysis to determine which causal queries are testable under different assumptions. In an expressivity evaluation, PLanet was the most expressive of existing DSLs. Critical reflections with the authors of these DSLs revealed that PLanet makes design choices explicit without requiring procedural specification. Think-aloud studies showed that PLanet facilitated design exploration and surfaced assumptions researchers may otherwise overlook.

SEFeb 28, 2022
Automatic Test-Case Reduction in Proof Assistants: A Case Study in Coq

Jason Gross, Théo Zimmermann, Rajashree Agrawal et al.

As the adoption of proof assistants increases, there is a need for efficiency in identifying, documenting, and fixing compatibility issues that arise from proof assistant evolution. We present the Coq Bug Minimizer, a tool for reproducing buggy behavior with minimal and standalone files, integrated with coqbot to trigger automatically on Coq reverse CI failures. Our tool eliminates the overhead of having to download, set up, compile, and then explore and understand large developments: enabling Coq developers to easily obtain modular test-case files for fast experimentation. In this paper, we describe insights about how test-case reduction is different in Coq than in traditional compilers. We expect that our insights will generalize to other proof assistants. We evaluate the Coq Bug Minimizer on over 150 CI failures. Our tool succeeds in reducing failures to smaller test cases in roughly 75% of the time. The minimizer produces a fully standalone test case 89% of the time, and it is on average about one-third the size of the original test. The average reduced test case compiles in 1.25 seconds, with 75% taking under half a second.