SE AIApr 20, 2024

Large Language Models as Test Case Generators: Performance Evaluation and Enhancement

arXiv:2404.13340v118.752 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the need for reliable test case generation in software development, offering a novel enhancement but is incremental as it builds on existing LLM capabilities.

The paper tackles the problem of using Large Language Models (LLMs) for test case generation, finding that state-of-the-art LLMs struggle with correctness as difficulty increases, and proposes a multi-agent framework called TestChain that improves test case accuracy by 13.84% over the baseline on a hard dataset.

Code generation with Large Language Models (LLMs) has been extensively studied and achieved remarkable progress. As a complementary aspect to code generation, test case generation is of crucial importance in ensuring the quality and reliability of code. However, using LLMs as test case generators has been much less explored. Current research along this line primarily focuses on enhancing code generation with assistance from test cases generated by LLMs, while the performance of LLMs in test case generation alone has not been comprehensively examined. To bridge this gap, we conduct extensive experiments to study how well LLMs can generate high-quality test cases. We find that as the problem difficulty increases, state-of-the-art LLMs struggle to generate correct test cases, largely due to their inherent limitations in computation and reasoning. To mitigate this issue, we further propose a multi-agent framework called \emph{TestChain} that decouples the generation of test inputs and test outputs. Notably, TestChain uses a ReAct format conversation chain for LLMs to interact with a Python interpreter in order to provide more accurate test outputs. Our results indicate that TestChain outperforms the baseline by a large margin. Particularly, in terms of the accuracy of test cases, TestChain using GPT-4 as the backbone achieves a 13.84\% improvement over the baseline on the LeetCode-hard dataset.

View on arXiv PDF

Similar