74.4SEMay 26
LLM-based Mockless Unit Test Generation for JavaQinghua Xu, Guancheng Wang, Lionel Briand et al.
Large language models (LLMs) have shown strong potential for automated test generation, yet most approaches to generating Java unit tests still rely on mocking frameworks to handle dependencies. Mockless test generation could exercise more real low-level code, but it faces challenges such as invalid test code generation due to hallucination, strict language constraints, and inadequate dependency awareness. We identify two causes behind these hallucinations: not knowing, where the LLM lacks sufficient context, and not following, where the LLM fails to comply with constraints even when they are provided. We present MocklessTester, a mockless unit test generation approach built around two strategies: context-enriched generation and constraint-enforced fixing. To mitigate not knowing, context-enriched generation mines real usage patterns from existing code to generate tests. To mitigate not following, constraint-enforced fixing performs two-stage repair under symbol-, protocol-, and iteration-level constraints, using a ClassIndex, a Markov typestate model, and experience memory. We evaluate MocklessTester against the state-of-the-art baseline on Defects4J and Deps4J. Results show that MocklessTester improves line coverage by 19.99% and 22.69% and branch coverage by 24.90% and 15.78% on the two benchmarks, respectively, and improves mutation score by 13.67% and 0.17%. Beyond the class under test, MocklessTester also exercises more real dependency code, covering 378 and 55 additional lines in dependency classes, respectively. The improvement in test quality comes with higher total token and time costs than the baseline. Nevertheless, the cost per method remains practical, averaging 108.97 seconds and 26.59k tokens on Defects4J, and 69.85 seconds and 25.46k tokens on Deps4J. Ablation results confirm that all major components contribute positively to the final performance.
65.5SEApr 23
Call-Chain-Aware LLM-Based Test Generation for Java ProjectsGuancheng Wang, Qinghua Xu, Lionel C. Briand et al.
Large language models (LLMs) have recently shown strong potential for generating project-level unit tests. However, existing state-of-the-art approaches primarily rely on execution-path information to guide prompt construction, which is often insufficient for complex software systems with rich inter-class dependencies, deep call chains, and intricate object initialization requirements. In this paper, we present CAT, a novel call-chain-aware LLM-based test generation approach that explicitly incorporates call-chain and dependency contexts into prompts through dedicated static analysis. To construct executable, semantically valid test contexts, CAT systematically models caller--callee relationships, object constructors, and third-party dependencies, and supports iterative test fixing when generation failures occur. We evaluate CAT on the widely used Defects4J benchmark and on four real-world GitHub projects released after the LLM's cut-off date. The results show that, across projects in Defects4J, CAT improves line and branch coverage by 18.04% and 21.74%, respectively, over the state-of-the-art approach PANTA, while consistently achieving superior performance on post-cutoff real-world projects. An ablation study further demonstrates the importance of call-chain and dependency contexts in CAT.
SEJun 18, 2020Code
Prioritizing documentation effort: Can we do better?Shiran Liu, Zhaoqiang Guo, Yanhui Li et al.
Code documentations are essential for software quality assurance, but due to time or economic pressures, code developers are often unable to write documents for all modules in a project. Recently, a supervised artificial neural network (ANN) approach is proposed to prioritize important modules for documentation effort. However, as a supervised approach, there is a need to use labeled training data to train the prediction model, which may not be easy to obtain in practice. Furthermore, it is unclear whether the ANN approach is generalizable, as it is only evaluated on several small data sets. In this paper, we propose an unsupervised approach based on PageRank to prioritize documentation effort. This approach identifies "important" modules only based on the dependence relationships between modules in a project. As a result, the PageRank approach does not need any training data to build the prediction model. In order to evaluate the effectiveness of the PageRank approach, we use six additional large data sets to conduct the experiments in addition to the same data sets collected from open-source projects as used in prior studies. The experimental results show that the PageRank approach is superior to the state-of-the-art ANN approach in prioritizing important modules for documentation effort. In particular, due to the simplicity and effectiveness, we advocate that the PageRank approach should be used as an easy-to-implement baseline in future research on documentation effort prioritization, and any new approach should be compared with it to demonstrate its effectiveness.
SEOct 29, 2019Code
MAT: A simple yet strong baseline for identifying self-admitted technical debtZhaoqiang Guo, Shiran Liu, Jinping Liu et al.
In the process of software evolution, developers often sacrifice the long-term code quality to satisfy the short-term goals due to specific reasons, which is called technical debt. In particular, self-admitted technical debt (SATD) refers to those that were intentionally introduced and remarked by code comments. Those technical debts reduce the quality of software and increase the cost of subsequent software maintenance. Therefore, it is necessary to find out and resolve these debts in time. Recently, many approaches have been proposed to identify SATD. However, those approaches either have a low accuracy or are complex to implementation in practice. In this paper, we propose a simple unsupervised baseline approach that fuzzily matches task annotation tags (MAT) to identify SATD. MAT does not need any training data to build a prediction model. Instead, MAT only examines whether any of four task tags (i.e. TODO, FIXME, XXX, and HACK) appears in the comments of a target project to identify SATD. In this sense, MAT is a natural baseline approach, which has a good understandability, in SATD identification. In order to evaluate the usefulness of MAT, we use 10 open-source projects to conduct the experiment. The experimental results reveal that MAT has a surprisingly excellent performance for SATD identification compared with the state-of-the-art approaches. As such, we suggest that, in the future SATD identification studies, MAT should be considered as an easy-to-implement baseline to which any new approach should be compared against to demonstrate its usefulness.
SEJan 27, 2021
An extensive empirical study of inconsistent labels in multi-version-project defect data setsShiran Liu, Zhaoqiang Guo, Yanhui Li et al.
The label quality of defect data sets has a direct influence on the reliability of defect prediction models. In this study, for multi-version-project defect data sets, we propose an approach to automatically detecting instances with inconsistent labels (i.e. the phenomena of instances having the same source code but different labels over multiple versions of a software project) and understand their influence on the evaluation and interpretation of defect prediction models. Based on five multi-version-project defect data sets (either widely used or the most up-to-date in the literature) collected by diverse approaches, we find that: (1) most versions in the investigated defect data sets contain inconsistent labels with varying degrees; (2) the existence of inconsistent labels in a training data set may considerably change the prediction performance of a defect prediction model as well as can lead to the identification of substantially different true defective modules; and (3) the importance ranking of independent variables in a defect prediction model can be substantially shifted due to the existence of inconsistent labels. The above findings reveal that inconsistent labels in defect data sets can profoundly change the prediction ability and interpretation of a defect prediction model. Therefore, we strongly suggest that practitioners should detect and exclude inconsistent labels in defect data sets to avoid their potential negative influence on defect prediction models. What is more, it is necessary for researchers to improve existing defect label collection approaches to reduce inconsistent labels. Furthermore, there is a need to re-examine the experimental conclusions of previous studies using multi-version-project defect data sets with a high ratio of inconsistent labels.