CLAug 12, 2024
Improving Structural Diversity of Blackbox LLMs via Chain-of-Specification PromptingHalley Young, Yimeng Zeng, Jacob Gardner et al.
The capability to generate diverse text is a key challenge facing large language models (LLMs). Thus far, diversity has been studied via metrics such as $n$-gram diversity or diversity of BERT embeddings. However, for these kinds of diversity, the user has little control over the dimensions along which diversity is considered. For example, in the poetry domain, one might desire diversity in terms of rhyme and meter, whereas in the code domain, one might desire diversity in terms of the kinds of expressions used to solve a problem. We propose a diversity metric called structural diversity, where the user provides a mapping from generated text to features capturing the kinds of diversity that they care about. In addition, we propose a novel strategy called chain-of-specification (CoS) prompting for improving diversity by first having the LLM generate a specification encoding one instance of structural features, and then prompting the LLM to generate text that satisfies these features; notably, our strategy works with blackbox LLMs. In our experiments, we show that for structural diversity in the poetry and code domains, CoS significantly improves diversity compared to several baselines.
PLMar 27
Sheaf-Cohomological Program Analysis: Unifying Bug Finding, Equivalence, and Verification via Äech CohomologyHalley Young
We present a framework in which program analysis -- type checking, bug finding, and equivalence verification -- is organized as computing the Äech cohomology of a semantic presheaf over a program's site category. The presheaf assigns refinement-type information to observation sites and restricts it along data-flow morphisms. The cohomology group $H^{0}$ is the space of globally consistent typings. The first cohomology group $H^{1}$ classifies gluing obstructions -- bugs, type errors, and equivalence failures -- each localized to a specific pair of disagreeing sites. This formulation yields three concrete results unavailable in prior work: (1) the rank of $H^1$ over $F_{2}$ counts the minimum independent fixes; (2) $H_{1}(U, Iso) = 0$ is sound and complete for behavioral equivalence; (3) Mayer-Vietoris enables compositional, incremental obstruction counting. We implement the framework in Deppy, a Python analysis tool, and evaluate it on a suite of 375~benchmarks: 133~bug-detection programs, 134~equivalence pairs, and 108~specification-satisfaction checks. Deppy achieves {100% bug-detection recall} (69% precision, F1 = 81%), 99% equivalence accuracy with zero false equivalences, and 98% spec accuracy with zero false satisfactions -- outperforming mypy and pyright, which report zero findings on unannotated code. The analysis models Python semantics as algebraic geometry: variables live on the generic fiber (non-None) unless on the closed nullable subscheme, integers form Spec($\mathbb{Z}$) with no bounded section (no overflow), and short-circuit evaluation defines an open-set topology on the presheaf.
SEApr 29
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification EvolvesHalley Young, Nikolaj Björner
Large language models can now generate substantial code and draft research text, but research-software projects require more than either artifact alone. The mathematical thesis, executable system, benchmark surface, and public claims must mature together, yet often drift apart. We identify two LM-specific failure modes: hallucination accumulation, in which claims exceed what code or theory supports and unsupported assertions propagate across sessions; and desynchronization, in which code, theory, or the model's own world model fall out of alignment. We propose Comet-H, an iterative prompt automaton that orchestrates ideation, implementation, evaluation, grounding, and paper-writing as coupled coordinates of a single workspace state. At each step, a controller selects the next prompt by scoring it against what the workspace currently lacks, carries unfinished follow-up work forward with a half-life, and re-checks the paper and README against the code and benchmarks whenever documentation changes. We frame prompt selection as a small contextual bandit problem over prompt families, with prompts as arms, workspace deficits as context, and a hand-weighted linear score. This transparent scorer, paired with a fading record of unfinished work, bounds long-horizon follow-ups, requires no learned policy, and makes each prompt choice legible from the workspace. We created a portfolio of 46 research-software repositories across two dozen domains. We study A3 in depth, a Python static-analysis tool built entirely within the loop, which reaches (F1 = 0.768) on a 90-case benchmark, compared with a next-best baseline of 0.364. Across approximately 400 commits, we find that audit-and-contraction passes dominate the later phases of every successful trajectory.
LGJan 24, 2019
Learning Neurosymbolic Generative Models via Program SynthesisHalley Young, Osbert Bastani, Mayur Naik
Significant strides have been made toward designing better generative models in recent years. Despite this progress, however, state-of-the-art approaches are still largely unable to capture complex global structure in data. For example, images of buildings typically contain spatial patterns such as windows repeating at regular intervals; state-of-the-art generative methods can't easily reproduce these structures. We propose to address this problem by incorporating programs representing global structure into the generative model---e.g., a 2D for-loop may represent a configuration of windows. Furthermore, we propose a framework for learning these models by leveraging program synthesis to generate training data. On both synthetic and real-world data, we demonstrate that our approach is substantially better than the state-of-the-art at both generating and completing images that contain global structure.