SEJun 3

DeployBench: Benchmarking LLM Agents for Research Artifact Deployment

arXiv:2606.0523887.2
Predicted impact top 3% in SE · last 90 daysOriginality Incremental advance
AI Analysis

For researchers developing LLM agents for software engineering, this benchmark reveals a critical gap in autonomous deployment of research artifacts, highlighting the need for better task completion judgment.

DeployBench introduces a benchmark of 51 research-artifact deployment tasks across AI/ML, systems, and scientific computing. Evaluating four LLMs with OpenHands yields pass-rates from 7.8% to 51.0%, with most failures due to agents prematurely stopping after validating weaker criteria than required.

LLM agents have made rapid progress on software engineering and ML research tasks, but these advances often assume access to a working runnable environment. For research artifacts released alongside published papers, setting up such an environment from a fresh machine remains a major bottleneck. Existing environment setup benchmarks do not cover the full scope of research artifact deployment, which involves multi-language toolchains, system-level dependencies beyond containers (e.g. GPU/CUDA and kernel configurations), and legacy artifact compatibility. We introduce DeployBench, a multi-domain benchmark of 51 research-artifact deployment tasks spanning AI/ML, computer systems, and scientific computing, covering all these dimensions. Each task is verified by a hidden pipeline that executes the paper's designated experiment and checks its outputs. Evaluating four state-of-the-art LLMs with OpenHands yields pass-rates from 7.8% - 51.0% . Failures are dominated by a completion-judgment problem: 97 of 154 are agent-terminated self-stops, where the agent's pre-finish checks validate a different or weaker target than the paper-specific task requires. DeployBench highlights the gap between current agents and autonomous deployment, and offers a realistic testbed for scientific research agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes