CLSEJan 23, 2025

DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale

arXiv:2501.13699v15 citationsh-index: 34ACL
Originality Incremental advance
AI Analysis

This addresses dependency inference issues for automated software development, providing a new benchmark to evaluate LLMs, but it is incremental as it builds on existing studies of runtime errors.

The paper tackles the problem of large language models incorrectly inferring dependencies in software repositories, which causes over 40% of runtime errors, by introducing DI-BENCH, a benchmark with 581 repositories across four programming languages, and finds that the best model achieves only a 42.9% execution pass rate.

Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40\% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs' capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 42.9% execution pass rate, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes