AI SEApr 28

DockSmith: Scaling Reliable Coding Environments via an Agentic Docker Builder

Jiaran Zhang, Luck Ma, Fanqi Wan, Di Qi, Xu Zhao, Jieyi Hou, Zhe Xie, Mengqiang Ren, Xin Wu, Zhewei Huang, Liangyu Chen, Qi Han

arXiv:2602.0059298.61 citationsh-index: 10Has Code

Predicted impact top 4% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For software engineering agents, DockSmith addresses the bottleneck of reliable Docker environment construction, enabling scalable execution-grounded training and evaluation.

DockSmith introduces an agentic Docker builder that treats environment construction as a core capability, achieving 39.72% Fail-to-Pass and 58.28% Commit Rate on Multi-Docker-Eval, and improving performance on SWE-bench Verified, SWE-bench Multilingual, and Terminal-Bench 2.0.

Reliable Docker-based environment construction is a dominant bottleneck for scaling execution-grounded training and evaluation of software engineering agents. We introduce DockSmith, a specialized agentic Docker builder designed to address this challenge. DockSmith treats environment construction not only as a preprocessing step, but as a core agentic capability that exercises long-horizon tool use, dependency reasoning, and failure recovery, yielding supervision that transfers beyond Docker building itself. DockSmith is trained on large-scale, execution-grounded Docker-building trajectories produced by a SWE-Factory-style pipeline augmented with a loop-detection controller and a cross-task success memory. Training a 30B-A3B model on these trajectories achieves open-source state-of-the-art performance on Multi-Docker-Eval, with 39.72% Fail-to-Pass and 58.28% Commit Rate. Moreover, DockSmith improves out-of-distribution performance on SWE-bench Verified, SWE-bench Multilingual, and Terminal-Bench 2.0, demonstrating broader agentic benefits of environment construction.

View on arXiv PDF

Similar