SEApr 8

Assessing REST API Test Generation Strategies with Log Coverage

arXiv:2604.0707353.2
AI Analysis

This work addresses the challenge of evaluating REST API tests for developers and testers in polyglot environments, but it is incremental as it builds on existing test generation methods with new metrics.

The paper tackled the problem of assessing REST API test generation strategies in black-box settings by proposing log coverage metrics and empirically evaluating three strategies on a microservice system, finding that Claude Opus 4.6 tests uncovered 28.4% more unique log templates than human-written tests, and combining strategies increased log coverage by up to 105.6%.

Assessing the effectiveness of REST API tests in black-box settings can be challenging due to the lack of access to source code coverage metrics and polyglot tech stack. We propose three metrics for capturing average, minimum, and maximum log coverage to handle the diverse test generation results and runtime behaviors over multiple runs. Using log coverage, we empirically evaluate three REST API test generation strategies, Evolutionary computing (EvoMaster v5.0.2), LLMs (Claude Opus 4.6 and GPT-5.2-Codex), and human-written Locust load tests, on Light-OAuth2 authorization microservice system. On average, Claude Opus 4.6 tests uncover 28.4% more unique log templates than human-written tests, whereas EvoMaster and GPT-5.2-Codex find 26.1% and 38.6% fewer, respectively. Next, we analyze combined log coverage to assess complementarity between strategies. Combining human-written tests with Claude Opus 4.6 tests increases total observed log coverage by 78.4% and 38.9% in human-written and Claude tests respectively. When combining Locust tests with EvoMaster the same increases are 30.7% and 76.9% and when using GPT-5.2-Codex 26.1% and 105.6%. This means that the generation strategies exercise largely distinct runtime behaviors. Our future work includes extending our study to multiple systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes