h-index32
2papers

2 Papers

SEFeb 10Code
SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents

Zhirui Zhang, Hongbo Zhang, Haoxiang Fei et al.

Although large language models (LLMs) have demonstrated impressive coding capabilities, their ability to autonomously build production-scale software from explicit specifications remains an open question. We introduce SWE-AGI, an open-source benchmark for evaluating end-to-end, specification-driven construction of software systems written in MoonBit. SWE-AGI tasks require LLM-based agents to implement parsers, interpreters, binary decoders, and SAT solvers strictly from authoritative standards and RFCs under a fixed API scaffold. Each task involves implementing 1,000-10,000 lines of core logic, corresponding to weeks or months of engineering effort for an experienced human developer. By leveraging the nascent MoonBit ecosystem, SWE-AGI minimizes data leakage, forcing agents to rely on long-horizon architectural reasoning rather than code retrieval. Across frontier models, gpt-5.3-codex achieves the best overall performance (solving 19/22 tasks, 86.4%), outperforming claude-opus-4.6 (15/22, 68.2%), and kimi-2.5 exhibits the strongest performance among open-source models. Performance degrades sharply with increasing task difficulty, particularly on hard, specification-intensive systems. Behavioral analysis further reveals that as codebases scale, code reading, rather than writing, becomes the dominant bottleneck in AI-assisted development. Overall, while specification-driven autonomous software engineering is increasingly viable, substantial challenges remain before it can reliably support production-scale development.

11.0PLMay 29
Practical Algebraic Stepping with Scoped Filters

Haoxiang Fei, Matthew Keenan, Cyrus Omar

Algebraic steppers help students learn functional programming by displaying evaluation as a sequence of small-step reductions, but even simple programs produce long traces in which key ideas are buried under mundane reductions. This paper presents the filtered stepper calculus, a formal framework that gives users scoped, pattern-based control over which reduction steps are shown or hidden. Users annotate programs with lightweight filter expressions that match on the structure of redexes. Filters compose via lexical scoping so that inner filters override outer ones. We prove preservation, progress, and a simulation theorem establishing that the filtered stepper agrees with the underlying unfiltered semantics, and mechanize all proofs in Agda. We implement the calculus in the Hazel live programming environment, including its support for stepping programs with holes and type errors. To do so, we reconcile Hazel's internal environment-based evaluator with the substitution-based presentation expected in the classroom. We deploy the system in a university programming languages course. Our evaluation shows that students adopt the stepper organically, though more advanced uses of filters may require further instruction, and that instructors can use filters to craft focused traces for use in lectures.