SEAISep 5, 2025

Combining TSL and LLM to Automate REST API Testing: A Comparative Study

arXiv:2509.05540v14 citationsh-index: 2SBES
Originality Incremental advance
AI Analysis

This addresses the problem of high manual effort and limited test coverage in REST API testing for development teams, though it appears incremental as it applies existing LLMs to a specific domain.

The paper tackles the challenge of automating REST API testing by introducing RestTSLLM, which combines Test Specification Language with Large Language Models to generate test cases from OpenAPI specifications. The results show that Claude 3.5 Sonnet outperformed other LLMs across all metrics, including success rate, test coverage, and mutation score.

The effective execution of tests for REST APIs remains a considerable challenge for development teams, driven by the inherent complexity of distributed systems, the multitude of possible scenarios, and the limited time available for test design. Exhaustive testing of all input combinations is impractical, often resulting in undetected failures, high manual effort, and limited test coverage. To address these issues, we introduce RestTSLLM, an approach that uses Test Specification Language (TSL) in conjunction with Large Language Models (LLMs) to automate the generation of test cases for REST APIs. The approach targets two core challenges: the creation of test scenarios and the definition of appropriate input data. The proposed solution integrates prompt engineering techniques with an automated pipeline to evaluate various LLMs on their ability to generate tests from OpenAPI specifications. The evaluation focused on metrics such as success rate, test coverage, and mutation score, enabling a systematic comparison of model performance. The results indicate that the best-performing LLMs - Claude 3.5 Sonnet (Anthropic), Deepseek R1 (Deepseek), Qwen 2.5 32b (Alibaba), and Sabia 3 (Maritaca) - consistently produced robust and contextually coherent REST API tests. Among them, Claude 3.5 Sonnet outperformed all other models across every metric, emerging in this study as the most suitable model for this task. These findings highlight the potential of LLMs to automate the generation of tests based on API specifications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes