SEAIApr 8

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

arXiv:2604.0674273.21 citationsh-index: 4
Predicted impact top 18% in SE · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the need for better benchmarks in intent-driven development for AI researchers and practitioners, though it is incremental as it focuses on CLI tools.

The paper tackles the problem of evaluating LLM-based generation of complete software from scratch by introducing CLI-Tool-Bench, a benchmark that assesses 0-to-1 CLI tool creation without predefined scaffolds, revealing that top models achieve under 43% success.

Large Language Models (LLMs) are driving a shift towards intent-driven development, where agents build complete software from scratch. However, existing benchmarks fail to assess this 0-to-1 generation capability due to two limitations: reliance on predefined scaffolds that ignore repository structure planning, and rigid white-box unit testing that lacks end-to-end behavioral validation. To bridge this gap, we introduce CLI-Tool-Bench, a structure-agnostic benchmark for evaluating the ground-up generation of Command-Line Interface (CLI) tools. It features 100 diverse real-world repositories evaluated via a black-box differential testing framework. Agent-generated software is executed in sandboxes, comparing system side effects and terminal outputs against human-written oracles using multi-tiered equivalence metrics. Evaluating seven state-of-the-art LLMs, we reveal that top models achieve under 43% success, highlighting the ongoing challenge of 0-to-1 generation. Furthermore, higher token consumption does not guarantee better performance, and agents tend to generate monolithic code.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes