SE LGSep 24, 2025

Benchmarking Web API Integration Code Generation

Daniel Maninger, Leon Chemnitz, Amir Molzam Sharifloo, Jannis Brugger, Mira Mezini

arXiv:2509.20172v48.01 citationsh-index: 14Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the difficulty of automating API integration for software developers, but it is incremental as it benchmarks existing models without proposing a new solution.

The paper tackled the problem of generating correct web API integration code using large language models (LLMs), and found that this is a significant challenge, with open-source models solving less than 40% of tasks due to errors like hallucinated endpoints and incorrect arguments.

API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models (LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present WAPIIBench, a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models was able to solve more than 40% of the tasks.

View on arXiv PDF

Similar