SEMay 8

Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

Chenyu Zhao, Shenglin Zhang, Zeshun Huang, Weilin Jin, Yongqian Sun, Dan Pei, Chaoyun Zhang, Qingwei Lin, Chetan Bansal, Saravan Rajmohan, Minghua Ma

arXiv:2511.0078073.03 citationsh-index: 20

Predicted impact top 22% in SE · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners in software engineering, this is the first architecture-aware benchmark for LLM-based build repair, though the results show current models are far from reliable.

The paper introduces Build-bench, a benchmark for evaluating LLMs' ability to repair build failures during cross-ISA migration. The best model achieves a 63.19% build success rate, revealing current limitations in handling real-world software migration tasks.

Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs). Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification. To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings. Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning. The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts. Through a comparative evaluation across the studied models, Build-bench reveals that current models achieve a maximum build success rate of 63.19% and tool usage patterns differ significantly across models. By coupling real build environments with verifiable outcomes, Build-bench establishes the first architecture-aware benchmark for studying LLM-based software build and repair.

View on arXiv PDF

Similar